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ABSTRACT 


A  model  is  proposed  for  learning  the  nature  and  value  of  an  unknown 
parameter,  or  unknown  parameters,  in  a  probability  distribution  which 
forms  part  of  a  body  of  statistics  related  to  some  system  or  process. 

The  model  is  Bayesian,  involving  the  assumption  of  an  a  priori 
probability  distribution  over  the  possible  values  of  the  unknown 
parameters;  the  performance  of  experiments  to  gain  information  about 
the  parameters;  and  the  alteration  of  the  a  priori  probabilities  by 
Bayes'  rule.  In  the  limit,  as  the  number  of  experiments  approaches 
infinity,  the  a  posteriori  distribution  in  most  cases  encountered  in 
practice  approaches  a  delta  function  at  the  true  values  of  the  unknown 
parameters,  so  the  system  learns  the  values  of  the  parameters  exactly. 
The  learning  process  developed  in  the  paper  is  shown  to  be  technically 
feasible  if  the  a  priori  and  a  posteriori  distributions  are  of  the 
same  form,  with  the  learning  accomplished  by  calculating  new  parameters 
for  these  distributions.  It  is  shown  that  a  necessary  and  sufficient 
condition  for  fulfillment  of  this  feasibility  criterion  is  for  a 
sufficient  statistic  of  fixed  dimension  to  exist.  If  such  a  sufficient 
statistic  exists,  the  a  posteriori  distributions  may  vary  in  form 
initially,  but  they  eventually  become  of  fixed  form.  The  techniques 
developed  indicate  logical  methods  for  choosing  a  priori  probabilities 
and  are  applied  in  pattern  recognition,  estimation,  and  other  problems. 
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I.  INTRODUCTION 


A.  PURPOSE 

The  purpose  of  the  study  described  in  this  paper  is  to  develop  a 
model  for  a  learning  technique  capable  of  utilizing  and  evaluating 
statistical  information  relating  to  a  physical  system  or  process.  The 
model  is  to  be  applicable  in  situations  where  the  form  of  the  probability 
distributions  describing  a  process  is  known,  but  where  the  values  of  some 
of  the  parameters  involved  in  these  distributions  are  unknown.  The 
model  is  to  be  readily  adaptable  to  construction  of  an  actual  learning 
machine  or  to  simulation  of  such  a  machine  on  a  digital  computer. 

It  is  expected  that  the  results  of  the  study  will  be  useful  in  the 
design  of  complex  multiple-element  systems,  including  a  variety  of 
different  types  of  communication  systems. 

B.  BACKGROUND 

Since  the  pioneering  work  of  Shannon  and  Wiener  in  19^-^9  [Refs. 
1-U1,  a  large  amount  of  research  has  been  done  on  application  of  statis¬ 
tical  techniques  to  design  of  communication  systems.  This  research  has 
been  motivated  by  the  realization  that  often  only  an  approximate  estimate 
of  the  conditions  under  which  a  communication  system  will  be  required  to 
operate  is  available.  Under  these  circumstances,  designing  the  system 
so  that  its  performance  will  be  the  best  possible  on  the  average  appears 
more  reasonable  than  attempting  to  optimize  performance  under  specific 
conditions  which  may  later  turn  out  to  be  inapplicable. 

To  achieve  the  best  possible  average  system  performance,  statistical 
techniques  are  applied.  A  specific  criterion  for  judging  system  perform¬ 
ance  is  defined;  then  the  techniques  of  probability  theory  are  utilized 
to  see  how  well  this  criterion  may  be  expected  to  be  satisfied.  Stating 
the  matter  in  more  mathematical  language,  excellence  of  system  perform¬ 
ance  is  judged  by  the  statistical  expectation  of  a  random  variable  Z 
which  represents  the  selected  performance  criterion.  In  some  cases  Z 
is  a  squared  error  term,  in  which  case  its  statistical  expectation 
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E[Z]  is  the  mean  squared  error;  in  other  cases  Z  is  the  fraction  of 
the  time  when  a  system  makes  an  error,  with  E[Z].  the  probability  of 
error. 

Although  the  mathematics  involved  are  often  complex,  the  applica¬ 
tion  of  the  statistical  criteria  is  in  principle  straightforward  pro¬ 
vided  a  body  of  statistics  relating  to  the  problem  is  available.  The 
statistics  can  often  be  computed  through  a  knowledge  of  the  physical 
principles  involved,  or  can  be  estimated  accurately  from  experience. 

In 'some  cases,  however,  the  statistics  are  not  accurately  known  and 
must  be  further  investigated  before  any  criteria  or  statistical  expec¬ 
tations  thereof  can  be  established.  This  fact  is  responsible  for  much 
of  the  current  emphasis  on  research  in  learning  techniques. 

In  connection  with  a  body  of  statistics,  a  learning  technique  may 
be  defined  as  a  procedure  for  evaluating  experimental  observations  in 
order  to  gain  information  about  parameters  involved  in  the  system  or 
process  to  which  the  statistics  apply.  Throughout  this  report  the  term 
learning  will  be  used  in  the  restricted  sense  suggested  by  this  defini¬ 
tion,  and  only  in  this  restricted  sense.  In  view  of  the  large  amount 
of  research  currently  being  done  on  learning  in  biological  systems,  it 
should  be  pointed  out  that  learning  in  the  sense  in  which  the  term  is 
used  here  may  bear  little  resemblance  to  learning  performed  by 
biological  systems. 

C.  ICtfTHOD  OF  APPROACH 

In  this  investigation  a  possible  model  for  the  process  of  learning 
the  values  oi  unknown  parameters  in  a  body  of  statistics  is  developed. 
Although  the  proposed  mode],  is  not  the  most  general  possible,  it  is 
general  enough  for  most  practical  purposes.  One  important  kind  of 
a  priori  information  is  postulated:  it  is  assumed  that  the  forms  of 
the  probability  distributions  involved  in  the  statistics  are  known, 
although  some  of  the  parameters  of  these  distributions  are  unknown. 

This  assumption  is  interpreted  to  mean  that  the  physical  process 
involved  is  known  well  enough  to  identify  the  type  of  probability 
density  being  dealt  with,  but  not  well  enough  to  permit  computation  of 
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all  the  parameters  for  this  density.  This  is  a  situation  often  occurring 
in  practice;  for  example,  it  might  be  known  that  a  probability  density 
was  multivariate  Gaussian,  but  the  mean  vector  or  covariance  matrix  for 
this  Gaussian  density  might  not  be  known. 

As  a  basic  procedure  it  is  assumed  that  the  symbol  6  represents 
some  unknown  parameter  or  parameters  in  one  of  the  known  probability 
densities.  In  order  that  the  statistical  expectation  E[Z]  can  be  com¬ 
puted  9  is  treated  as  a  random  variable  and  an  a  priori  probability 
density  p(0)  is  assumed  over  the  range  of  its  possible  values.*  The 
expectation  E[Z]  is  then  determined  from  the  standard  statistical 
equation 


e[z ]  =  J  e[z|sj  p(e)  de  (1) 

The  learning  model  developed  in  this  investigation  is  based  on  a 
series  of  modifications  of  Eq.  (l).  These  modifications  will  be  discussed 
in  the  next  chapter. 


This  so-called  "Bayesian"  technique  of  treating  a  fixed  but  unknown 
parameter  as  a  random  variable  is  common  engineering  practice,  though 
frowned  on  by  many  statisticians.  Even  in  statistical  circles,  however, 
the  practice  appears  to  be  gaining  wider  acceptance  [Refs.  5  and  6 j. 
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II.  THE  LEARNING  MODEL 


A.  BASIC  EQUATION 

It  has  been  shown  that,  for  a  body  of  statistics  related  to  some 
physical  process  or  system, 

Efz]  =  j  E[z|e]  p(e)  de  (1) 

where: 

6  =  an  unknown  parameter  or  parameters  in  the  probability 

distributions  included  in  the  statistics 

Z  =  a  random  variable  representing  a  selected  performance 

criterion 

E[Z]  »  the  statistical  expectation  of  Z 

p(0)  =  the  a  priori  probability  density  function  of  9 

[p(0)  or  some  information  which  may  be  utilized  in 
choosing  p(@)  is  assumed  to  be  known  a  priori*] 

E[Z|@]  «  the  conditional  expectation  of  Z  given  9 

(the  expectation  of  Z  is  assumed  to  be  known 
a  priori  as  a  function  of  9 ;  for  any  specific 
value  of  0,  E[z|0]  is  the  value  that  would  be 

used  for  E[Z]  if  9  were  known  to  have  the 
postulated  value). 

In  this  Investigation  Eq.  (l)  is  to  be  used  as  the  basis  for  a 
learning  model;  however,  modification  of  Eq.  (l)  is  suggested  by  the 
fact  that,,  if  the  value  of  9  were  known  more  accurately,  more  confi¬ 
dence  could  be  placed  in  the  value  of  E[Z]. 

B.  LEARNING  OBSERVATIONS 

The  obvious  way  to  improve  the  extent  of  knowledge  about  9  is  to 
perform  an  experiment,  or  a  set  of  experiments,  to  gain  information  about 
the  parameters.  Let  the  set  of  outcomes  of  some  such  set  of  learning 
observations  be  designated  by  A^.  cannot  be  expected  to  tell 

One  of  the  results  of  this  investigation  is  to  indicate  vays  of 
choosing  p(e)  vhen  this  density  is  only  approximately  known. 
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exactly  what  the  value  of  G  is ,  since  it  has  been  assumed  that  Q 
cannot  be  measured  accurately;  however,  it  is  assumed  that  the  proba¬ 
bility  density  function  of  the  learning  observations  is  known  as  a 
function  of  Q,  If  the  probability  density  function  of  the  learning 
observations  were  not  known,  or  if  it  were  not  a  function  of  6,  there 
would  be  little  to  gain  from  performing  the  experiments.  The  probability 
density  function  of  the  learning  observations  is  denoted  by  p(A^  |@).* 

In  the  present  study  it  ic  also  assumed  that  3[z|9]  is  independent 
of  A^.  This  may  be  interpreted  as  an  assumption  that  is  used  only 

to  improve  the  extent  of  knowledge  about  G  and  does  not  influence  the 
values  of  G .  (An  example  of  an  equivalent  assumption  is  the  assumption 
that  inserting  an  ammeter  in  an  electric  circuit  to  measure  the  current 
does  not  change  the  magnitude  of  the  current;  any  other  assumption  that 
the  measurement  of  a  quantity  does  not  influence  the  magnitude  of  that 
quantity  is  also  equivalent. ) 

C.  MORE  ACCURATE  VERSION  OF  STATISTICAL  EXPECTATION 

The  information  is  now  available  to  compute  a  more  accurate  version 
of  E [ Z  ] .  First,  Bayes'  rule**  is  applied  to  obtain 

p(/s  \e)  p(0) 

p(e[q)  - - - -  (2) 

/  p(q|e)  r(0)  de 


A  quantity  of  the  form  of  p(A]_|@),  when  treated  as  a  function  of 
0,  is  often  called  a  "likelihood"  rather  than  a  "probability  density. 
As  a  function  of  for  fixed  0,  p(A]_|0)  has  been  defined  to  be 

a  probability  density.  As  a  function  of  0  for  fixed  A-^,  however, 
p(Ap|e)  is  not  a  true  probability  density;  although  it  satisfies  one 
of  the  requirements  for  a  probability  density  by  being  non- negative, 
it  does  not  normally  satisfy  the  requirement  of  integrating  to  one. 

In  the  subsequent  discussion  the  term  "probability  density"  will  be 
used  when  quantities  of  the  form  of  p(Aq|0)  are  considered  as  func¬ 
tions  of  observations,  while  the  term  "likelihood"  will  be  used  when 
such  quantities  are  considered  as  functions  of  0. 

Bayes'  rule  is  the  standard  equation  for  computing  conditional 
probabilities.  It  may  be  found  in  any  textbook  on  probability  theory. 
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where 


A 

1 


|e ) 


p(e) 


=  the  outcomes  of  a  set  of  learning  observations  used  to 
gain  information  about  0 

=  probability  density  function  of  the  learning  observations 
(when  treated  as  a  function  of  for  fixed  6) 

»  likelihood  function  of  0  (when  treated  as  a  function 
of  6  for  fixed  A^) 

(this  quantity  is  assumed  to  be  known  as  a  function  of 
both  A^  and  6;  it  is  used  as  a  likelihood  function 
in  Eq.  (2)) 

=  a  priori  probability  density  function  of  6 

=  a  posteriori  probability  density  function  of  6 
(this  function  is  assumed  to  be  evaluated  in  the 
light  of  A ^  by  Eq,  (2)) 


The  new  expectation  for  Z  is  then  calculated  as 


EfzAj  mj  E[z|0]  p(0 1^)  ae 


(3) 


where 

E[Z|a^]  =  the  statistical  expectation  of  Z  incorporating  the 
information  gained  from  the  observations  A^ 

=  the  conditional  expectation  of  Z  given  the  observations 

A 

E[Z|0]  =  the  conditional  expectation  of  Z  given  6,  expressed 

as  a  function  of  Q,  and  assumed  independent  of  A^. 

This  calculation  completes  one  stage  of  the  learning  process.  A 
more  accurate  version  of  E[Z]  has  been  obtained,  but  it  may  be  desired 
to  obtain  a  still  more  accurate  version.  This  even  more  accurate  version 

cL 

of  learning  observations  is  taken,;  p(8jA^,A^)  is  computed  by  Bayes' 
rule;  and  this  density  is  used  to  compute  Efzj/\^,A0].  Then  a  third 
set, 


can  be  obtained  by  repeating  the  previous  process .  Another  set, 


A  ,  of  learning  observations  is  taken  and  the  process  is 


repeated.  The  progressively  developing  results  of  the  learning  process 
can  be  expressed  in  terms  of  the  three  sequences; 


{  •  )* — a) — *-{  \’A2  } -etc.;  (4a) 

p(e) - •■p(e|A1) - *.p(e|.\1,A2) — •.  etc.;  (4b) 

E[Z  ] - •.B[2|Al]  - -Efzj^Ag] — ►etc.,  (4c) 
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In  the  most  general  case  a  model  for  the  learning  process  can  become 
complex.  The  computations  to  be  performed  at  any  time  may  depend  on  the 
entire  set  of  priori  observations,  as  is  shown  by  the  general  form  of 
Bayes *  rule 


p(Anl9,A1,  ...  A^)  P(9[A1,  An-1) 

J  p(An|e,A1,  ...  An_±)  p(e  1^,  ...  An_x)  ae 

(5) 

Equation  (5)  indicates  how  the  new  probability  density  for  0  can 
be  computed  from  the  old  density;  but  the  computation  requires  that  the 
probability  of  A^  be  known  as  a  function  of  0  and  of  all,  the  previous 
observations,  i.e.,  as  •  ••  It  is  often  possible  to 

simplify  this  computation,  however.  If  it  be  assumed  that  the  different 
sets  of  learning  observations  are  conditionally  independent  (of  each 
other)  given  0,*  Eq.  (5)  can  be  simplified  to 

p(Aje)  P(e|v  ...  \ml) 


p(aJo)  pCs^,  ...  An_1)  ae 


conditional  independence,  for  any  two 

p(e)  de  =  J  p(A1|e)  p(Aje)  p(e)  ae 

while 

p(aa)  p(Aj)  =  J  [  p(aj  |e)  p(e)  p(Aj|0)  p(0)  ded0 

Comparing  these  two  equations  it  is  seen  that  in  general 

pCa^  A,  )  /  p(Ai)  p(Aj) 

if  P(e)  is  a  delta  function,  however,  the  inequality  becomes  an 
equality.  Thus,  this  conditional- independence  assumption  may  be  inter¬ 
preted  as  an  assumption  that,  if  0  were  known,  the  A.  would  be 
statistically  independent  of  each  other.  With  0  unknown,  however, 
the  statistical  dependence  of  each  A^  on  0  introduces  interdepend¬ 
ence  among  the  themselves.  This  interdependence  among  the  A. 

makes  the  learning  process  possible;  the  interdependence  insures  tKat 
statistical  information  relating  to  the  value  of  6  is  available  in 
the  learning  observations. 
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vherein: 


p(e|/v 

p(Anl0) 

p(© | A^, 


A  )  =  a  posteriori  probability  density  of  Q, 

evaluated  in  the  light  of  the  learning 
observations  A^,  •  •  •  A 

=  livelihood  function  on  0  given  by  the  n^ 
set  of  learning  observations 

A  .  )  =  probability  density  of  0 ,  evaluated  in  the 
“-1  Uibtrf  .J,  ...  *Dj 

-  a  posteriori  probability  density  after  n-1 
sets  of  learning  observations 

=  a  priori  probability  density  just  prior  to 
taking  n*'a  set  of  learning  observations. 


Expanding  Eq.  (3)  to  include  the  improved  density  calculated  from 
Eq.  (6),  results  in 


E[2[Alf  ...  An]  mj  E[S|0]  p(0|/'1,  ...  Au)  <10 


vherein  E[3|@l  is  assumed  independent  of  A^,  ...  A  . 


(7) 


D'.  IMHjEMENTAT ION  OF  LEARNING  MODEL 

The  learning  r  -  races s  indicated  by  Eq.  (7)  can  be  implemented  as 
shown  in  Figaro  1.*  The  process  is  reiterative,  with  the  same  computa¬ 
tions  performed  after  obtaining  each  set  of  learning  observations,  but 
vii.h  the  probability  .j-nsitv  on  0  updated  each  time  it  is  used  in  the 
compotat ion. 


FIG.  1.  MODE!.  KOH  LEARNING  PROCESS. 


* 

For  a  model  applicable  in  the  more  general  case,  where  the 
conditional- independence  assumption  is  not  involved,  see  Chapter  V, 
Section  F.. 
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The  special  case  covered  by  Eqs.  (6)  and  (7)  and  Fig.  1,  though 
subject  to  limitations  because  of  the  assumption  of  the  conditional 
independence  of  the  learning  observations  A^,  . . .  A  ,  is  an  important 
one;  in  fact,  it  is  the  case  of  primary  interest  in  this  investigation. 
Many  of  the  results  of  the  study  are  valid  for  more  general  cases,  how¬ 
ever;  hence,  in  the  development  of  the  theory  of  the  learning  process  the 
possibility  of  more  general  results  is  indicated. 

E.  DISCUSSION  OF  LEARNING  MODEL 

The  learning  model  proposed  herein  is  only  one  of  many  possible 
models.  Before  it  is  analyzed  in  detail  some  of  the  implications  of 
the  model  should  be  discussed. 

In  proposing  the  model  a  Eayesian  approach  to  the  learning  problem 
is  used.  This  approach  is  often  criticized  as  relying  too  much  on  sub¬ 
jective  information,  especially  in  the  choice  of  a  priori  probability 
distributions.  A  priori  information  is  seldom  exact,  so  that  the 
a  priori  probability  distributions  are  normally  fairly  arbitrary.*  On 
the  other  hand,  Bayesian  methods  usually  allow  the  use  of  all  available 
a  priori  information,  even  if  some  subjective  elements  are  involved. 

Such  methods  are  often  applied  in  cases  where  the  information  available 
is  subjective;  yet  these  methods  have  been  found  to  give  reasonable 
results.  A  detailed  discussion  of  the  implications  of  the  Bayesian 
approach  is  given  by  Savage  [Ref.  6]. 

The  model  analyzed  in  the  present  investigation  can  also  be  con¬ 
sidered  to  be  a  decision-theory  model.  The  methods  of  statistical  de¬ 
cision  theory  (a  theory  that  has  been  developed  largely  on  Bayesian 
lines)  normally  involve  assuming  a  priori  probability  distributions, 
performing  experiments  to  obtain  additional  information,  then  making  the 
type  of  computations  indicated  in  Fig.  1. 

If  the  model  illustrated  in  Fig.  1  is  considered  as  a  model  of  a 
statistical-decision-theory  computation,  the  techniques  of  decision 

One  of  the  most  important  results  of  the  work  reported  here  is  to 
indicate  reasonable  methods  for  choosing  a  priori  probability  func¬ 
tions.  The  methods,  though  rational,  do  not  remove  the  subjective 
element  from  the  a  priori  judgment,  however. 
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theory  can  be  used  to  optimize  the  performance  of  a  physical  or  other 
system  under  consideration.  At  least,  the  performance  will  be  optimum 
if  the  correct  assumptions  are  made  in  the  analysis.  Since,  as  noted 
above,  some  of  these  assumptions  are  almost  always  subjective,  the  form 
of  the  "optimum"  system  found  by  one  person  may  differ  from  that  obtained 
by  another.  It  can  be  said  that,  if  the  assumptions  made  by  a  particular 
investigator  for  the  analysis  are  the  best  that  his  knowledge  allows  him 
to  make,  then  the  system  performance  is,  to  the  extent  of  his  knowledge, 
optimum;  but  claims  stronger  than  this  are  not  defensible.  As  the 
number  of  learning  observations  increases,  however,  the  subjective  ele¬ 
ments  become  relatively  unimportant,  since  the  a  posteriori  proba¬ 
bility  distributions**  become  largely  independent  of  the  a  priori 
distributions  [Ref.  5]. 

A  characteristic  of  the  Bayesian  approach  that  distinguishes  it 
from  most  other  approaches  to  the  learning  problem  is  the  fact  that  no 
specific  value  of  the  unknown  parameter  6  is  selected  at  any  one  time. 
Rather,  a  probability  distribution  p(0)  over  the  possible  values  of 
0  is  always  considered,  and  the  expectation  of  the  performance  criterion 
is  computed  based  on  this  distribution  [see  Eqs.  (l),  (3);  and  (7)]« 
Another  approach  to  the  problem  would  be  to  estimate  a  specific  value  of 
C  in  some  way,  then  to  use  the  estimate  as  if  it  were  the  true  value 
of  6.  The  two  approaches  are  normally  equivalent  in  the  limit  as  the 
number  of  learning  observations  increases  without  limit.  The  common 
estimates  of  G  (for  example,  maximum- likelihood  estimates  or  Bayes 
estimates)  converge  in  the  limit  to  the  true  value  of  the  parameter, 
t his  convergence  taking,  place  with  probability  one.  Similarly,  it  will 
be  shown  that  the  probability  density  function  p(@|A^,  ...  A^)  obtained 
in  the  learning-process  model  developed  in  this  paper  converges  with 
probability  one  to  a  delta  function  at  the  true  value  of  0.  Except 
for  this  limiting  case,  however,  specific  values  of  6  are  not  selected, 


This  interpretation  is  similar  to  the  ,,personalistic,,  interpretation 
of  probability  theory  advocated  by  Savage  [Ref.  6], 

** 

I.e.,  the  probability  distributions  obtained  at  the  end  of  the 
entire  sequence  of  computations. 
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although  the  probability  densities  discussed  in  connection  with  the 
learning  model  would  probably  be  useful  in  arriving  at  a  specific 
estimate  of  0. 

The  significance  of  the  use  of  a  probability  distribution  p(0) 
over  the  possible  values  of  0  deserves  some  comment.  A  number  of 
interpretations  of  the  significance  of  this  distribution  are  possible. 
For  example,  0  could  be  considered  to  be  chosen  from  an  ensemble  of 
possible  values  according  to  the  probability  density  p(0);  or  the 
assumption  might  be  made  that  the  uncertainty  about  0  is  caused  by 
some  noise  (i.e.,  irrelevant  interference)  in  the  selection  process. 

Or,  without  any  explanation  at  all,  it  may  simply  be  considered  that 
0  is  a  random  variable  representing  available  knowledge  of  the  unknown 
parameter.  The  result  of  the  procedure  is  probably  more  important  than 
its  justification.  The  essential  point,  no  matter  how  interpreted,  is 
that  the  parameter  0  is  basically  to  be  treated  as  a  random  variable. 
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III.  THE  LEARNING  PROCESS  AND  PROBABILITY  DISTRIBUTIONS 


A.  EARLY  STUDIES  OF  THE  LEARNING  PROCESS 

Earlier  investigators  have  analyzed  a  number  of  examples  and  special 
cases  of  the  learning  process  [Refs.  7-17]*  Some  of  these  earlier  in¬ 
vestigations  furnished  the  impetus  for  developing  the  more  general 
learning  model  proposed  in  the  present  paper.  Examples  of  special 
interest  are  those  that  fall  within  the  special  case  covered  by  Eq.  (7) 
and  Fig.  1,  wherein  the  learning  observations  are  assumed  conditionally 
independent  given  0.  Important  examples  of  the  learning  process  involve 
the  application  of  learning  techniques  to  the  pattern-recognition  prob¬ 
lem.  The  analysis  of  the  pattern-recognition  problem,  perse,  is  only 
of  peripheral  interest  at  this  point,  but  the  problem  does  present  an 
interesting  challenge  to  the  learning  technique.  Therefore,  enough  of 
the  theory  of  the  pattern-recognition  problem  will  be  developed  to  show 
that  the  learning  model  illustrated  in  Fig.  1  is  applicable  (with  minor, 
theoretically  insignificant,  modifications). 

B.  THE  PATTERN  RECOGNITION  PROBLEM 

It  is  assumed  that  there  exist  r  possible  patterns,  designated  by 
the  indices  1,  2,  . . .  r,  and  that  it  is  desired  to  classify  an  observa¬ 
tion  X  as  representing  one  of  these  patterns.  The  criterion  of  excel¬ 
lence  Z  is  taken  as  the  fraction  of  the  patterns  identified  correctly. 
Thus,  E[Z]  is  the  probability  of  correct  identification.* 

Clearly,  E[Z]  can  be  maximized  by  maximizing  its  value  for  any 
given  observation.  That  is,  for  any  given  X,  the  conditional  expecta¬ 
tion  E [ Z ] X ]  is  to  be  maximized.  But  E [Z | X ]  is  the  conditional  proba¬ 
bility  of  correct  identification  given  the  observation  X  and  hence  is 
maximized  if  the  pattern  with  highest  probability  of  being  correct  is 
chosen.  Putting  these  requirements  together,  it  is  found  that  the 
optimum  strategy,  or  the  strategy  with  maximum  probability  of  correct 

* 

This  is  the  criterion  obtained  with  a  statistical-decision-theory 
approach  and  a  zero-one  loss  function  (i.e.,  zero  loss  for  a  correct 
decision,  loss  of  one  unit  for  an  error). 
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identif ication,  is  to  pick  the  pattern  for  vhich  the  conditional  proba¬ 
bility  P(i|x)  is  maximum.  This  strategy  can  be  implemented  by  com¬ 
puting  P(i|x)  for  each  i,  or  pattern  class,  then  feeding  the  results 
of  these  computations  into  a  comparator  that  selects  the  class  for  vhich 
P( i | X)  is  maximum.  This  leads  to  the  implementation  shown  in  Fig.  2. 

A  few  modifications  of  the  procedure  indicated  in  Fig.  2  are  normally 
made  in  implementing  such  a  system.  Expanding  P(i|x)  by  Bayes'  rule: 

p(i|x)  =  (8) 


where : 

P(i|.X)  =  a  posteriori  probability  of  the  i^h  pattern  class  given 
the  observation  X  [this  function  is  assumed  to  be 
evaluated  in  the  light  of  X  by  Eq.  (8)] 

p(x|i)  =  conditional  probability  density  of  the  observation  X 
given  that  the  i^  pattern  is  being  observed  (this 
density  is  assumed  known  as  a  function  of  X  for  any 
pattern  class --at  least,  in  the  conventional  pattern 
recognition  problem  being  discussed  at  this  point  it  is 
known) 

P(i)  =  a  priori  probability  of  the  i^h  pattern  class 

(this  probability  is  also  assumed  known  for  each  pattern 
class  in  the  conventional  problem) 


FIG.  2.  PATTERN- RECOGNITION  SYSTEM. 
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p(x)  =  unconditional  probability  density  of  the  observation  X 
(the  availability  of  this  density  is  unimportant  as  the 
discussion  below  shows  that  it  is  not  actually  needed). 

Since  p(x)  does  not  depend  on  i  it  can  be  discarded  as  a  variable 
and  attention  can  be  focused  on  maximizing  p(x|i)P(i).  It  is  further 
assumed  (for  simplicity)  that  all  P(i),s  are  known  and  equal,  so  that 
all  that  remains  is  merely  to  maximize  p(x[i). 

The  earlier  work  on  the  pattern-recognition  problem  [Refs..  7-10]  has 
been  based  on  the  computation  of  p(x|i)  when  some  parameter  0^  in 
this  probability-density  function  is  unknown.  The  basic  equations  are 
slight  modifications  of  Eq,s.  (l)  and  (7)** 

p(x|i)  =  J  p(x[ i, 0± )p(0± )  dS1  (9) 

p(x|i,Au,  ...  Ain)  =J  p(X|i,0i)p(01|Ail,  ...  A.n)  d0.  (10) 

The  A.  .  are  assumed  to  be  sets  of  learning  observations  from  the 

ij 

i™  pattern  class. 

Gince  the  procedure  for  all  pattern  classes  is  identic:.  1,  the 
subscripts  i  are  now  dropped  to  simplify  notation. 


0.  OTHER  EXAMPLES  OF  THE  LEARNING  PROCESS 

Abramson  and  Braverman  [Refs.  7-9]  have  been  primarily  concerned 
with  the  case  where  p(  X  )  is  known  to  be  Gaussian,  p(X  )  ~  N(M  ,  K  ),** 


It  would  be  simple  to  make  the  correspondence  between  Eqs.  a)  and 
(9)  and  between  Eqs.  (7)  and  (10 )  more  exact  by  defining  random  variables 
with  expectations  p(x[i)  and  p(x|i,Aj^,  ...  Aj_n). 

Symbols  that  represent  matrices  (including  vectors)  are  in  boldface 
type.  When  a  symbol  is  used  to  represent  a  variable  that  could  be  either 
a  real  number  or  a  vector  or  matrix  (for  example,  the  general  parameter 
0),  ordinary  type  is  used,  however.  The  notation  p(  X  )  ^  N(m  >K  ) 
may  be  read,  "The  probability  density  of  the  vector  X  (actually  the 
joint  density  of  the  components  of  X  )  is  normally  distributed  (or 
Gaussian)  with  mean  vector  M  and  covariance  matrix  K  • ” 
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with  the  covariance  matrix  K  known  but  the  mean  vector  M  unknown. 

In  other  words,  Abramson  and  Braverman’s  unknown  parameter  0  is  the 
mean  vector  M  of  a  Gaussian  density.  They  assume  a  Gaussian  a 
priori  density  for  M  ,  p(  M  )  ~  N and.  obtain  an  a  pos¬ 
teriori  density,  p(  M  |a^),  which  is  also  Gaussian,  p( M  |A^) 

~  N )  with  and  ^  easily  computed  from  , 

and  A^.  The  densities  for  X  ,  both  a  priori  and  a  posteriori, 
are  also  Gaussian,  p(x  )  ~  N(  p.Q,  <f>Q  +  K  )  and  p(x  |A]_) 

~  N(  <f>1  +  K  ). 

The  second  stage  in  the  learning  process  under  study  illustrates 
vThy  this  particular  process  is  feasible.  Since  p(  M  |a^)  is  of  the 
same  form  as  p(  M  )  (i.e.,  Gaussian),  and  the  second  stage  involves 

the  same  computations  as  the  first  stage  with  p(  M  |a^)  substituted 
for  P(  M  ),  Gaussian  probability  densities  are  again  obtained  for 
M  and  X  .  By  induction  it  is  seen  that  this  will  happen  after 
each  set  of  learning  observations.  Hence,  the  form  of  the  learning 
system  remains  fixed  as  more  learning  observations  are  taken. 

After  each  set  of  learning  observations  A  ,  the  new  mean  ^ 
for  the  density  on  M  is  computed  as  a  weighted  average  of  fi  ^ 
and  the  average  of  the  observations  in  A  .  In  the  limit,  as  the 
number  of  learning  observations  approaches  infinity,  fi^  approaches 
the  average  of  all  the  learning  observations.  It  is  known,  from  the 
strong  lav  of  large  numbers  [Ref.  18],  that  with  probability  one  the 
average  of  the  observations  approaches  the  true  value  M  of  the 
mean.  At  the  same  time,  the  elements  of  the  covariance  matrix  d> 
approach  zero.  Thus,  the  limiting  form  of  p(  M  |a^,  ...  A^)  is 
N(  M  ,0).  Comparing  this  with  the  multivariate  Dirac  delta  function, 
it  is  found  that  the  limiting  form  of  the  a  posteriori  density  on  M 
is  a  Dirac  delta  function  at  the  true  value  of  the  mean. 

If  this  delta  function  is  put  into  the  equation  for  p(X  [a^,  ...  A^), 
it  is  found  that  the  density  approaches  the  form  for  known  parameters. 
Hence,  the  entire  system  converges  to  the  form  it  would  take  if  the 
parameters  were  known. 

The  solution  for  the  problem  of  learning  the  unknown  mean  was  ob¬ 
tained  in  a  fairly  simple  manner.  The  assumption  of  a  Gaussian  a  priori 
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probability  density  on  M  is  the  obvious  assumption  to  make  since 
M  is  a  parameter  in  a  Gaussian  density.  This  assumption  gives 
Gaussian  a  posteriori  densities  on  M  ,  and  insures  that  all  the 
densities  required  are  Gaussian. 

Keehn  [Ref.  10]  has  analyzed  a  similar  problem  and  obtained  similar 
results.  For  his  problem  the  assumptions  that  keep  the  form  of  the 
learning  system  fixed  are  less  obvious,  however.  Keehn  has  analyzed  the 
problem  of  learning  the  covariance  matrix  K  for  a  Gaussian  density 
when  the  mean  vector  M  is  known. 

The  key  ass'Jimption  necessary  to  solve  the  unknown  covariance  problem 

is  the  assumption  of  a  Wishart  a  priori  density  over  the  elements  of 

-1  * 

the  inverse  covariance  matrix  K  .  The  a  posteriori  density  on  the 
elements  of  is  also  Wishart,  with  new  parameters  calculated  from 

the  old  parameters  and  the  learning  observations.  The  limiting  form  of 
the  a  posteriori  density  is  again  a  delta  function  at  the  true  values 
of  the  unknown  parameters,  in  this  case  the  true  values  of  the 
components  of  the  inverse  covariance  matrix. 

The  probability  density  for  X  turns  out,  in  this  case,  to  be  a 
Student  density  instead  of  the  Gaussian  density  one  might  expect.  As 
the  number  of  learning  observations  approaches  infinity,  however,  the 
limiting  form  of  the  Student  density  becomes  Gaussian  with  the  true  mean 
vector  and  covariance  matrix.  Hence,  the  limiting  form  of  the  a  pos¬ 
teriori  density  on  X  is  as  desired. 

Keehn  has  analyzed  in  a  similar  manner  the  case  where  both  K  and 
M  are  unknown.  He  obtained  analogous  results  by  assuming  a  composite 
Wishart-Gaussian  density  on  the  elements  of  K  M-**  The  a  pos¬ 

teriori  density  is  also  of  this  composite  form  and  converges  to  a  delta 
function  at  the  true  values  of  the  unkno>m  parameters.  The  density  on 
X  is  a  modified  form  of  the  Student  density,  which  approaches  the 
true  Gaussian  density. 


The  form  of  this  density  is  given  in  Chapter  VI,  To.ble  2,  Case  u. 
The  f)rm  of  this  density  is  given  in  the  Appendix,  Eq.  (A-7). 
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D.  FEASIBILITY  OF  THE  LEARNING  PROCESS  AS  DETERMINED  BY 

PROBABILITY  DISTRIBUTIONS 

The  examples  cited  above  illustrate  one  method  of  guaranteeing  that 
the  learning  process  is  feasible.  If  it  is  possible' to  pick  an  a  priori 
density  p(0)  for  9  such  that  the  a  posteriori  density  p(0 |a^,  . ..A^) 
is  of  the  same  form  (e.g.,  both  Gaussian  or  both  Wishart),  then  the  Bayes' 
rule  computer  merely  computes  new  values  for  the  parameters  describing 
the  density  on  6  in  terms  of  the  old  values  and  the  learning  observa¬ 
tions.  If  the  form  of  the  density  is  preserved  after  one  set  of  learning 
b serve t ions,  the  arguments  used  for  the  Gaussian  case  show  that  it  is 
preserved  no  matter  hov  many  learning  observations  are  taken.  Hence, 
the  learning  process  is  feasible  in  the  sense  under  consideration- - i. e. , 
in  the  sense  that  a  fixed  form  of  computations  is  applicable  throughout 
the  entire  process. 

The  learning  process  is  considered  to  be  feasible  if  the  computations 
necessary  after  taking  learning  observations  are  fixed,  neither  the 
number  nor  the  forms  of  the  computations  changing.  This  requirement  of 
a  fixed  set  of  computations  is  imposed  from  the  point  of  view  of  engineer¬ 
ing  feasibility.  If  the  system  can  learn  by  performing  a  fixed  set  of 
computations  after  each  observation  period,  the  engineering  problems  in 
designing  an  actual  system  may  be  soluble;  if  the  system  has  to  be  repro¬ 
grammed  periodically,  or  if  the  number  of  computations  necessary  grows 
without  bound,  the  design  problems  almost  certainly  are  not  soluble. 

1.  Reproducing-Type  Distributions 

In  the  present  investigation,  probability  distributions  that 
preserve  their  form  under  Bayes1  rule,  l.e.,  for  which  the  a  priori 
and  a  posteriori  distributions  have  the  same  form,  will  be  designated 
frs  " reproducing -type  distributions. "  Besides  the  investigators  mentioned 
above,  a  number  of  other  persons  have  utilized  distributions  of  this 
type.  Bellman  [Ref.  11]  has  utilized  a  beta  density  for  learning  the 
parameter  characterizing  a  binomial  distribution;  Mosiraann  [Ref.  12] 
has  utilized  the  "multivariate  beta"  or  Dirichlet  distribution  for  the 
parameters  of  a  multinomial  distribution;  Turin  [Ref.  13  ]  has  used  the 
"generalized  Rayleigh’’  or  Rician  density  for  learning  the  amplitude 
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and  phase  characteristics  of  a  radio  channel;  and  Kailath  [Kef.  l4]  has 
utilized  a  Gaussian  distribution  for  learning  the  unknown  mean  of  a 
Gaussian  distribution,  obtaining  results  similar  to  those  of  Abramson 
and  Br&verman  in  a  different  manner.  None  of  these  -workers  give  methods 
for  finding  reproducing -type  distributions,  however.  The  only  general 
method  of  finding  reproducing-type  distributions  that  has  been  found  in 
the  literature  is  that  of  Kaiffa  and  Schlaifer  [Ref.  15 ]•  These  authors 
discuss  an  important  class  of  reproducing-type  distributions- -a  class 
that  includes  all  the  reproducing  distributions  mentioned  above  save  the 
Rician  distribution  utilized  by  Turin.* 

2.  Nonreproducing  Distributions 

Lest  the  reader  gain  the  impression  that  reproducing-type  distribu¬ 
tions  always  exist,  so  that  the  problem  is  merely  one  of  finding  the 
appropriate  reproducing  distribution,  attention  is  called  to  one  example 
of  a  case  where  no  reproducing  distributions  exist.  This  example  is 
taken  from  a  problem  studied  by  Daly  [Refs.  1 6  and  17],  which  is  similar 
to  the  problems  studied  by  Abramson,  Braverman,  et  al.  The  chief 
difference  between  Daly's  problem  and  the  cases  hitherto  mentioned  lies 
in  the  form  of  the  information  given  to  the  learning  system  during  the 
learning  process.  An  important  assumption  in  the  analysis  of  the 
examples  previously  considered  has  been  the  assumption  that  the  learning 
observations  were  classified--! . e. ,  the  system  was  told  to  which  pattern 
each  learning  observation  corresponded.  This  assumption  made  it  possible 
to  state  that  the  A.  .  in  Eq.  (10)  consisted  of  samples  from  the  i^h 
pattern  class.**  Daly  assumed  that  the  system  was  not  given  this 


* 

The  forms  of  all  these  densities,  including  that  used  by  Turin,  are 
derived  Ln  Chapter  VI  and  in  the  Appendix. 

In  a  typical  application  of  this  theory  the  system  would  be  given 
a  set  of  classified  patterns  during  a  training  period,  then  would  be 
told  to  identify  unclassified  patterns  later.  In  a  few  cases  the  correct 
classification  of  patterns  might  be  available  with  a  slight  delay,  with 
a  decision  needed  earlier.  The  same  techniques  could  be  used  as  in  the 
first  case,  but  with  the  added  possibility  of  indefinitely  continuing 
the  training  period. 
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information,  either  during  the  learning  process  or  during  the  recogni¬ 
tion  process.  The  two  problems  may  be  distinguished  by  calling  the 
former  the  "perfect-teacher”  problem  and  the  latter  the  "no-teacher" 
problem, 

A  simple  example  of  the  "no -teacher"  problem  would  allow  for  two 

alternative  hypotheses:  either  (l)  both  noise  and  a  one-dimensional 

signal  of  unknown  magnitude  m  are  present;  or  (2)  the  noise  alone  is 

present.  Assuming  Gaussian  noise  distribution  with  zero  mean  and 
2 

variance  cr  ,  and  assuming  also  that  the  two  hypotheses  are  equally 
probable,  the  conditional  probability  density  of  an  observation  X 
given  m  is: 

p(x|m)  =  |  •  — (exp  -(X-m)2/2a2  +  exp  -X2/^2}  (ll) 
■Jzit  ff 

If  an  a  priori  probability  density  p(m)  is  assumed  and  if  a 
set  X1,  . . .  of  measurements  chosen  according  to  the  density 

given  by  Eq,  (ll)  are  used  as  learning  obs ervations,  it  is  found  that 


pfalx^ 


p(x,>  ...  X  |m)  p(m) 

XQ)  =  - - - - - 

J  PCX^,  ...  Xn|m)  p(ra)dm 

^  ^exp  -(Xi-m)2/2<r2  +  exp  -  X^/2<r2^ p(m) 
Ji ^exp  -  (Xi-m)2/2o-2  +  exp  -X^/2ct2^  p(m)dm 


(12) 


In  each  of  the  earlier  examples  the  a  posteriori  density 
p(o|/^,  ...  />n)  was  expressible  in  terms  of  a  fixed  number  of  parameters 
no  matter  how  many  learning  observations  were  taken.  Thus,  the  form  of 
the  density  did  not  change  as  the  learning  observations  progressed.  In 
the  case  of  learning  a  Gaussian  mean  M  ,  only  two  parameters,  fi 
and  ^  ,  were  necessary.  Since  the  Wishart  density  is  expressed  in 
terms  of  a  fixed  set  of  parameters,  a  similar  situation  was  true  for 
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learning  the  covariance  matrix  K  or  for  learning  both  M  and 
K.  This  is  not  the  case  with  the  density  in  Eq.  (12),  however.  In 
fact,  no  njndegenerate  form  for  p(m)  has  been  found  that  allows 
p(m|x^  , ,,  Xn)  to  be  expressed  in  terms  of  fewer  than  n  parameters 
(one  for  each  X^).  It  is  shown  In  Chapter  VI,  Section  D,  that  expres¬ 
sion  in  terms  of  fewer  than  n  parameters  is  impossible  with  any  non- 
degenerate  p(m);  hence,  the  form  of  the  density  keeps  changing  as  long 
as  the  learning  observations  are  continued. 

The  example  of  the  "no -teacher"  problem  clarifies  what  is  meant  by 
saying  that  the  a  priori  and  a  posteriori  densities  are  of  the  same 
f o r  m ;  this  requirement  must  be  interpreted  to  include  expression  of  the 
densities  in  terms  of  a  fixed  number  of  parameters.  Otherwise,  the 
density  in  Eq.  (12.)  might  conceivably  be  considered  to  be  reproducing, 
since  the  expression  in  the  last  part  of  this  equation  is  always  valid. 
The  example  also  indicates  that  it  cannot  automatically  be  assumed  in 
any  particular  case  that  reproducing -type  densities  exist. 


E.  PROBLEMS  FOR  FURTHER  INVESTIGATION 

Examples  of  the  learning  process  studied  in  this  chapter  have 
described  three  main  problems: 

1.  To  find  general  conditions  under  which  the  a  posteriori 
probability  density  approaches  a  delta  function  at  the  true 
value  of  the  unknown  parameter. 

2.  To  find  conditions  guaranteeing  the  existence  of  reproducing- 
type  probability  distributions. 

3-  To  find  the  forms  of  any  re producing -type  probability 
distributions  that  may  exist  in  a  particular  case. 

These  problems  are  investigated  in  the  following  chapters. 
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IV .  CONDITIONS  UNDER  WHICH  THE  A  POSTERIORI _ DISTRIBUTION 


APPROACHES  A  DELTA  FUNCTION 


This  chapter  considers  the  first  problem  posed  at  the  end  of 
Chapter  III:  to  find  general  conditions  under  which  the  a  posteriori 
probability  distribution  approaches  a  delta  function  at  the  true  value 
of  the  unknown  parameter. 

A.  THE  CONVERGENCE  THEOREM 

In  each  of  the  examples  of  learning  processes  discussed  in  Chapter  III 
the  limiting  form  of  the  a  posteriori  density  p(@|A^,  ...  A^)  as 
n  increases  is  a  delta  function  at  the  true  value  of  0.  The  conditions 
needed  to  insure  that  this  is  so  are  simple:  it  must  be  possible  to 
calculate  the  true  value  of  0  from  an  infinite  sequence  of  observa¬ 
tions,  and  this  true  value  must  not  be  ruled  out  by  p(0),  the  a  priori 
probability  distribution  on  0.  More  rigorously: 

Theorem  I.  Assume  that  the  following  conditions  are  satisfied: 

1.  0  is  the  true  value  of  0 

o 

2.  The  a  priori  density  p(0)  >0  in  some  sphere  containing  0 

3-  The  a  posteriori  densities  p(0|/u,  ...  A  )  are  calculated  by 
Bayes’  rule  n 

4.  There  exists  a  sequence  of  functions  f^(A^,  ...  A  )  converging 
to  0  with  probability  one.  '  1 

Then  p(0  |  A^,  ...  A^)  6  (0  -  0  )  with  probability  one,  where  &  (0  -  0q) 

is  a  Dirac  delta  function  (of  the  same  dimension  as  0). 

Proof:  Theorem  I  is  an  immediate  consequence  of  the  zero-one  law 
of  probability  theory  as  stated  by  Loeve  [Ref.  18,  p.  398]*  The  state¬ 
ment  of  this  law  used  here  is,  "The  sequence  P(BjY^,  ...  Y  )  of 
conditional  probabilities  of  a  property  B  of  the  sequence  Y^,  Y^,  ... 
given  the  first  n  terms  of  the  sequence  converges  almost  surely  to 
1  or  0  according  as  the  sequence  has  or  has  not  this  property. ”  If 
B  is  a  sphere  in  the  range  of  6,  then  the  event  that 
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lim 

nrMO 


f»<v  a„>  • B* 


is  an  event  defined  on  the  and  hence  satisfies  Loeve's  definition 

of  a  ’’property*’  of  the  sequence.  Therefore, 


P(B|A1,  ...  An)  =  /  p(e  [A^  ...  An)  d0— 1  or  0  (13) 

B 

according  as  9 1  is  or  is  not  in  B.  Equation  (13)  is  equivalent  to 

the  statement  that  p(@|A^,  ...  A^)  converges  to  b(9  -  0^).** 

Since  Theorem  I  and  its  proof  are  fairly  abstract,  the  significance 
of  the  assumptions  should  be  pointed  out.  Assumption  (4)  guarantees 
that  the  event  that  €  B  is  a  property  of  the  sequence.  Assumption 
(l)  guarantees  that  this  event  is  true,  or  that  the  sequence  has  the 
desired  property.  Assumption  (3)  guarantees  that  the  correct  forms  are 
used  for  the  a  posteriori  probabilities,  since  these  probabilities 
are  calculated  by  the  standard  methods  of  probability  theory.  The 
other  assumption,  number  (2),  is  hidden  in  Loeve’s  statement  of  the  zero- 
one  lav.  In  all  of  the  material  he  treats,  Loeve  assumes  the  events 
considered  have- positive  probability..  Assumption  (2)  insures  that 
this  is  true. 

From  the  definition  of  the  Dirac  delta  function  and  Eq.  (3)  there 
is  derived  the  important 

Corollary:  If  the  assumptions  in  Theorem  I  are  satisfied, 

E[z|a^,  ...  A  ]  E  [  Z  |  ]  vith  probability  one,  where  Z  is  a  random 

variable  representing  a  selected  performance  criterion. 


"ft 

The  symbol  €  in  this  equation  should  be  read  "is  in"  or  "belongs 

to." 

-X~K- 

Theorem  I  is  based  on  Theorem  5*1  of  Braverman  [Ref.  7,  p.  29]. 

The  material  just  presented  comprises  a  more  precise  statement  of  the 
theorem  and  simplifies  the  proof.  The  proof  is  still  quite  abstract, 
however,  despite  its  deceptively  simple  appearance.  Those  readers 
unable  to  follow  the  proof  completely  may  treat  it  as  a  plausibility 
argument. 
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This  corollary  indicates  that  the  entire  system  approaches  the  form 

it  would  take  if  0  were  known  to  be  the  true  value  of  0. 
o 

B.  DISCUSSION  OF  THEOREM 

Theorem  I  is  more  general  in  its  import  than  may  at  first  be 
apparent.  No  statements  have  been  made  as  to  whether  a  ’’teacher”  is 
present  or  not.  It  has  not  been  required  that  any  type  of  independ¬ 
ence  hold,  nor  does  Loeve  require  independence  for  his  theorem.  It  is 
merely  required  that  the  sequence  of  functions  fn(A^,  ...  A^)  exist, 
ouch  a  sequence  can  exist  either  with  or  without  a  teacher,  either  with 
or  without  independence. 

The  requirement  that  this  sequence  of  functions  exist  is  simply  a 

method  of  saying  that  the  true  value  of  0  must,  with  probability  one, 

be  determinable  from  an  infinite  sequence  of  learning  observations.  If 

it  be  assumed  that  the  sets  of  learning  observations  consist  of  single 

observations,  i.e.,  A.  =  /x  }  ,  and  that  the  X.  are  conditionally 

1  v  1 

independent  given  0  (the  same  independence  assumption  used  in  Chapter 
II),  this  requirement  can  be  put  into  a  more  easily  visualized  form.  In 
this  case  if  a  function  of  a  single  observation,  such  that 

E[f(X±) |© ]  =  9,  (Ik) 

exists,  then  by  the  strong  lav  of  large  numbers, 

n^ 

i  2  <i5> 

i=l 

with  probability  one,  where  0q  is  the  true  value  of  0.* 

- 

In  applying  the  strong  law  of  large  numbers  to  this  case,  it  is 
necessary  to  recall  the  earlier  interpretation  of  the  requirement  that 
the  Xi  be  independent  given  0.  In  Chapter  II,  this  requirement  was 
interpreted  to  mean  that  if  0  were  known  the  X i  would  be  independent. 
The  knowledge  available  about  0  does  not  affect  the  convergence  of 
Eq.  (15)  ;  so  the  strong  law  is  applied  as  if  it  were  known  that  0 
equals  0q. 
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As  an  example,  in  the  case  of  the  unknown  mean  of  a  Gaussian 
distribution,  the  sample  average 

n 

i  y* 

n  Lj  i 
i=l 

converges  to  the  true  value  of  the  mean  with  probability  one.  Similarly, 
for  the  case  of  an  unknown  covariance  matrix,  the  sample  covariance 
matrix 


HTI  L  <Xi  -  "  >t  (*1  -  «  ) 

1=1 

converges  to  the  true  covariance  matrix  with  probability  one. 

Theorem  I  can  also  be  applied  to  the  simple  example  of  the  "no- 
teacher"  problem  discussed  in  Chapter  III.  Pbr  the  density  given  by 
Eq.  (11), 


E[x|m]  -  |  m. 

Hence,  by  Eqs.  (l4)  and  (15) 

n 

i  >  X.— . 

n  Lj  1  0 

i=l 

with  probability  one,  where  is  the  true  value  of  m.  This  result 

agrees  with  Daly's  application  of  limiting  arguments  [Refs.  l6  and  17] 
to  show  that  the  limiting  form  of  the  optimum  system  is  the  form  it 
would  take  if  m  were  known. 

As  the  conditions  of  Theorem  I  are  met  for  most  probability  distribu¬ 
tions  of  practical  significance,  this  theorem  provides  reasonably  general 
conditions  insuring  that  the  limiting  fora  of  the  a  posteriori  density 
is  a  delta  function  at  the  true  value  of  0 .  Thus,  Theorem  I  affords  a 
solution  to  the  first  of  the  three  problems  posed  at  the  end  of  Chapter 
III. 


(16) 


(17) 
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C.  ILLUSTRATION  OF  CONVERGENCE 


An  illustration  of  the  manner  in  vhich  the  a  posteriori  density 
approaches  a  delta  function  is  given  by  Fig.  3*  In  this  figure  are 
plotted  probability  densities  for  the  parameter  P  characterizing  a 
binomial  distribution.  A  uniform  a  priori  density  over  the  interval 
from  0  to  1  has  been  assumed,  and  the  a  posteriori  density 
p(p|^»  •••  Aq)  has  been  plotted  under  the  assumption  that  equal 
numbers  (l,  2,  4,  8  and  l6)  of  occurrences  of  each  of  the  two  possible 


FIG.  3.  PROHAHII.ITY  DENSITIES  FOR  THE  PARAMETER  P 
CHARACTERIZING  A  HINOMIAI.  DISTRI ACTION . 


-  25  - 


SEL-6 3-099 


events  have  been  observed.*  The  conclusion  from  the  plot  is  that  the 
value  of  P  becomes  known  more  and  more  accurately  as  more  observations 
are  taken- -this  is  illustrated  by  the  continuously  decreasing  width  of 
the  plots  in  Fig.  3--with  the  true  value  of  P  becoming  known  exactly 
after  an  infinite  number  of  observations,  when  the  density  becomes  a 
delta  function  at  the  true  value  of  P,  P  =  1/2. 


Since  the  a  priori  density  p(p)  is  uniform  and  none  of  the  a 
posteriori  densities  are  uniform  (in  this  case  all  of  the  a  posteriori 
densities  are  beta),  the  a  priori  density  in  this  example  is  not 
reproducing- type.  However,  since  all  the  a  posteriori  densities  are 
of  the  same  form,  the  densities  may  be  considered  to  become  reproducing- 
type  after  one  observation.  It  is  shown  in  Chapter  V,  Section  D,  that 
a  posteriori  densities  often  become  reproducing-type  after  a  few 
observations  even  when  the  a  priori  density  is  not  reproducing-type. 
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V.  CONDITIONS  FOR  REPRODUCING -TYPE  PROBABILITY  DISTRIBUTIONS 


This  chapter  attacks  the  second  and  third  problems  posed  at  the  end 
of  Chapter  III:  namely,  the  problem  of  finding  conditions  guaranteeing 
the  existence  of  reproducing- type  probability  distributions,  and  also  the 
problem  of  finding  the  forms  of  any  such  distributions  that  may  exist. 

A  reproducing -type  probability  distribution  has  been  defined  as  one 
in  which  the  a  posteriori  distribution  p(@|A^,  ...  A^)  has  the  same 
form  as  the  distribution  p(@)  assumed  a  priori,  the  two  distributions 
being  related  through  Bayes'  rule  applied  in  the  light  of  a  series  of 
learning  observations  A^,  ...  A^  (Eqs.  (2)  and  (6)).  The  first  step 
in  the  present  study,  therefore,  is  to  find  a  convenient  method  for 
analyzing  the  form  of  p(@|A^,  ...  A^)  in  any  particular  case. 


A.  FACTORIZATION  OF  A  POSTERIORI  DENSITY  (ASSUMING  LEARNING 

OBSERVATIONS  ARE  CONDITIONALLY  INDEPENDENT  GIVEN  0) 

A  principal  difficulty  in  analyzing  the  form  of  the  a  posteriori 
probability  density  p(@|A^,  ...  A^)  as  it  is  given  by  Bayes*  rule 
arises  from  the  arbitrary  nature  of  the  a  priori  density  p(0).  The 
only  real  requirement  put  on  the  a  priori  density  is  that  it  be  a  true 
probability  density;  hence,  it  must  be  non-negative  and  integrate  to 
one.  Since  p(@)  is  involved  in  the  computation  of  each  of  the  a  pos¬ 
teriori  densities  p($|a^,  ...  A^),  this  introduces  some  arbitrariness 
into  each  of  these  a  posteriori  densities.  This  may  be  illustrated  by 
writing  Bayes'  rule  in  terms  of  the  likelihood  of  the  complete  sequence 
of  sets  of  learning  observations,  i.e.,  in  terms  of  p(A^,  ...  An|0): 


An|e)  p(e) 

An|e)  p(e)  de 


(18) 


Fortunately,  the  expression  in  Eq.  (l8)  for  the  a  posteriori 
density  may  be  factored  in  a  manner  that  simplifies  analysis  of  its 
form. 
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Theorem  II:  Assume  the  likelihood  p(A^,  ...  1 0 )  is  greater  than 
zero  and  is  an  integrable  function  of  0.  Then  p(0|A^>  ...  \  )  can  be 
expressed  as 


p(fl|\.  •••  \)  -  p(o|\»  •••  \) 


jd£l 


i[p(e)|A1»  •••  An] 


(19) 


where 


p(®l^»  •••  An) 


p0\,  ...  An |e) 
p(.a1>  •  • .  An  |e)  10 


(20) 


is  a  probability  density  on  0  depending  only  on  the  observations,  and 
where  E[p(0)|A^,  ...  A^]  is  the  expectation  of  the  a  priori  density 
p(0)  taken  with  respect  to-  the  density  p(0 |a^,  ...  Aq).  Further,  if 
p(0)  is  bounded  and  p(0  )  >  0,  then 

P(0|A1,  ...  An)-^&(0  -  0Q)  (21) 


with  probability  one  if  and  only  if 


•••  An)-^S(0  -  0Q)  (22) 

Proof:  The  function  p(9  ,  ...  AQ)  is  by  its  definition  in  Eq. 

(20 )  a  legitimate  and  well-defined  probability  density,  since  it  has 
been  assumed  that  p(A^,  ...  A  |'0')  >  0  and  is  integrable.  Rewriting 
Eq.  (18)  in  the  form 


(18a) 
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and  incorporating  the  definition  of  p(©|/^,  ...  a  )  in  Eq.  (20)  it  is 
seen  that  p(©|A^,  ...  An)  may  be  written  in  the  form  in  Eq.  (19). 

To  prove  the  convergence  portions  of  Theorem  II,  assume 
£(©1^,  . . .  An)  -*  &(©  -  0Q)  as  specified  in  Eq.  (22).  Then,  since 
p(©o)  >0  by  assumption,  and  E[p(0)|A1,  ...  Aq]  approaches  p(©Q) 
as  p(0 1^,  ...  An)  -  6  (©  -  ©o), 

p(0|A1,  ...  An)  -  '  ®0)  =  6(6  ’  ®o)  (23) 

Conversely,  if  it  be  assumed  that  p(0|/^,  ...  A^)  -  6(0  -  0 1  ),  then 
Eq.  (19)  indicates  that 

•••  An)  =  p(e|A1,  ...  An)  E[p(0) \a±,  ...  Aq]  /p(a) 

-6(0  -  0Q)  E[p(0)|A1,  ...  Aj/p(0)  (24) 

Since  £[p(0)|a^,  ...  A  ]  is  a  constant  and  p(0)  has  been  assumed  to 
be  bounded,  Eq.  (24)  can  be  valid  only  if  p(0  |A^,  ...  A  )  -6(0  -  0  ). 

The  density  p(0|A^,  ...  A  ),  which  might  be  called  the  "experi¬ 
mental  portion"  of  the  a  posteriori  density,,  is  simply  a  normalized 
version  of  the  likelihood.  It  is  a  function  of  A^,  . . .  A^  as  well 
as  of  0,  but  it  is  here  assumed  that  the  observations  have  been  made 
and  A^,  ...  A^  have  been  replaced  by  the  results  of  the  observations. 
Under  these  conditions,  p(0 [a^,  ...  A  }  is  a  function  of  the  single 
variable  0. 

The  integrability  condition  on  p(A^,  ...  A  |0)  in  Theorem  II  is 
normally  fulfilled  for  large  n,  as  this  density  tends  to  become  more 
and  more  concentrated  near  the  true  value  of  0,  so  that  the  effective 
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range  of  integration  is  small.*  In  all  cases  thus  far  encountered  for 
which  the  techniques  of  Theorem  II  are  applicable,  p(A^,  ...  A  )■©) 
becomes  integrable  after  a  few  observations  (typically  one  or  two)  and 
remains  integrable  as  more  observations  are  made.  Unless  otherwise 
stated,  it  will  henceforth  be  assumed  that  this  integrability  condition 
is  satisfied. 


B.  EXPERIMENTAL  PORTION  OF  A  POSTERIORI  DENSITY 

Theorem  II  indicates  that,  at  least  after  a  large  number  of  learning 
observations,  the  behavior  of  p(@|A^,  ...  A^)  is  primarily  determined 
by  the  "experimental  portion"  p(0|\^,  ...  A^).  Also,  the  latter  density 
is  less  arbitrary  and  consequently  easier  to  work  with  than  is  the  basic 
function.  The  conditions  that  must  be  satisfied  for  the  "experimental 
portion"  of  the  a  posteriori  density  to  be  reproducing  are  now  to  be 
invest  L gated. 

Definition  No.  1:  The  a  priori  density  p(0)  is  said  to  reproduce 
itself  with  respect  to  the  likelihood  p(A.  |@)  if  p($)  and  the 


Bindley  [Ref.  5]  has  shown  that  with  any  reasonably  smooth  a  priori 
density,  the  limiting  form  of  the  a  posteriori  density  is  independent 
of  the  a  priori  density,  being  Gaussian  with  means  at  the  maximum 
likelihood  values  and  with  variances  decreasing  as  l/n.  (Another  type 
of  density,  possibly  a  reproducing -type  density,  may  approximate  the 
a  posteriori  density  slightly  more  accuracy,  but  both  this  density  and 
Lindley's  Gaussian  density  approach  each  other  and  the  delta  function 
limit  of  Theorem  I. )  A  general  proof  that  the  effective  range  of 
integration  approaches  zero  is  easily  deduced  from  Lindley’s  result. 

The  limiting  form  Lindley  obtains  is  almost  identical  to  the 
limiting  form  for  the  probability  density  of  a  maximum-likelihood  esti¬ 
mate.  This  latter  density  can  be  found  in  many  standard  statistics 
texts.  An  alternative  approach  to  proving  that  the  effective  range  of 
integration  approaches  zero  could  be  based  on  these  maximum-likelihood 
analyses . 

An  illustration  of  the  manner  in  which  the  effective  range  of  inte¬ 
gration  for  p(Ap,  ...  Aq]©)  approaches  zero  may  be  deduced  from  Fig. 

3.  Since  in  that  figure  a  uniform  a  priori  density  was  assumed,  the 
a  posteriori  density  plotted  in  the  figure  is  proportional  to 
p(Ap,  ...  An)©),  and  the  effective  range  of  integration  is  the 
effective  width  of  the  plot. 
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a  posteriori  density  p (0  |a^)  are  members  of  the  same  family  of 
probability  densities,  differing  only  in  the  values  of  the  parameters 
characterizing  densities  in  this  family. 

If  p(0)  reproduces  itself,  the  result  of  the  Bayes'  rule  computa¬ 
tion  in  the  learning  process  is  simply  to  compute  new  values  for  the 
parameters  characterizing  densities  in  the  family,  this  computation 
giving  p(0|A^).  The  next  stage  of  the  learning  process  involves  the 
same  computations  save  for  replacing  p (0)  by  p ( 0 | )  and  using  the 
set  A g  of  learning  observations  instead  of  If  these  sets  of 

learning  observations  are  of  the  same  type,  p(0|A^)  reproduces  itself 
with  respect  to  the  likelihood  p(Aoj0)  if  p(0)  reproduces  itself 
with  respect  to  p(A^|$).  Proceeding,  by  induction,  it  is  seen  that 
p  (  0  |  A^ ,  ....  reproduces  itself  vith  respect  to  p  ( A^  1 0 )  if  p(0) 

reproduces  itself  vith  respect  to  p(A^|0). 

Thus,  under  the  assumed  set  of  conditions,  the  fact  that  p(0) 
reproduces  itself  vith  respect  to  the  likelihood  p(Ajj0)  guarantees 
that  all  the  a  posteriori  densities  are  members  of  the  same  family 
of  probability  densities.  At  each  stage  of  the  learning  process  the 
Bayes1  rule  computer  merely  computes  new  values  for  the  parameters, 
describing  these  densities.  The  remainder  of  the  computations  involved 
in  the  learning  process,  multiplication  by  E[Z|@]  and  integration,  are 
fixed  computations  (see  Fig.  l)  and  can  always  be  accomplished  in  the 
same  manner.  Even  if  the  result  of  this  computation  cannot  be  obtained 
analytically  in  closed  form,  it  can  be  obtained  by  a  fixed  procedure  of 
numerical  integration  or  by  electronic  integration.  Hence,  if  p(0) 
reproduces  itself  vith  respect  to  p(A^|@),  the  computations  necessary 
for  the  entire  learning  process  are  the  same  at  each  stage  of  the  process.. 
It.  is  assured  that  the  system  will  not  have  to  be  reprogrammed  in  the 
middle  of  the  learning  process. 

Strictly  speaking,  the  sets  of  learning  observations  or  the  likeli¬ 
hoods  p(AL|@)  should  be  included  in  any  statement  about  densities 
reproducing  themselves.  In  cases  where  the  meaning  is  clear,  however, 
reference  will  be  made  to  the  densities  p(0)  as  being  reproducing-type 
densities,  without  specific  mention  of  the  learning  observations. 
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C.  GUFF IC I ENT  STATISTICS 


In  actual  computations  of  the  a  posteriori  probabilities,  it  is 
often  unnecessary  to  have  available  all  the  individual  learning  observa¬ 
tions.  It  often  happens  that  some  functions  of  the  learning  observations 
will  suffice  tVr  computing  the  a  posteriori  probabilities.  For 
example,  the  a  posteriori  probability  density  for  the  mean  of  a  Gaussian 
distribution  given  the  sample  average  for  the  learning  observations  is  the 
same  as  the  a  posteriori  density  given  all  the  individual  observations. 

A  function  of  the  learning  observations  which,  in  this  sense,  contains 
all  the  information  in  the  observations  relevant  to  learning  0  is 

'X' 

called  a  sufficient  statistic  for  0. 

In  working  with  sufficient  statistics  it  is  considered  that  they 
are  written  in  the  form  of  a  vector  with  real-valued  components.  That 
is,  if  T( A  )  is  a  sufficient  statistic  for  0,  it  is  assumed 
ihat 

T(V  •••*„)  .  (>W  ...  (25) 

where  the  t|n^  are  real-valued  functions  of  A^,  ...  A^.  There 

follows  the  obvious 

Definition  No.  2:  The  dimension  of  a  sufficient  statistic  is  the 
number  of  components  in  the  vector  representation  of'  the  statistic. 

In  the  case  of  learning  the  unknown  mean  of  a  Gaussian  density 
mentioned  above,  the  sample  average  is  a  sufficient  statistic  of 
fixed  dimension  (d  dimensions  if  a  d-variate  Gaussian  density  is 
being  considered) .  In  some  cases,  however,  the  only  sufficient  statis¬ 
tic  is  equivalent  to  the  learning  observations  themselves**  and  no 
sufficient  statistic  of  fixed  dimension  exists.  The  distinction  is  of 

fundamental  importance,  as  indicated  by  Theorem  III  below. 

— 

A  general  treatment  of  sufficient  statistics  has  been  given  by 
Dynkin  [Ref.  191*  Among  other  things  he  finds  conditions  for  the 
existence  of  sufficient  statistics  of  the  forms  needed  for  this  study 
and  methods  for  computing  such  sufficient  statistics. 

-x* 

The  statistic  is  equivalent  to  the  observations  if  the  observations 
can  be  computed  from  the  statistic  and  vice  versa. 
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It  is  now  possible  to  state  a  simple  criterion  for  determining 
whether  the  experimental  portion  of  the  a  posteriori  density  is  repro¬ 
ducing  or  not.  Since  this  density  is  not  defined  before  observing 
the  procedure  suggested  by  Definition  No.  1  is  slightly  altered  by 
checking  whether  p(0|'A^)  reproduces  itself  with  respect  to  p(A^  I®) 
or  not. 

Theorem  III:  The  probability  density  p($|a^)  reproduces  itself 
with  respect  to  the  likelihood  P  (  A^  [  0 )  if  and  only  if  a  sufficient 
statistic  for  6  of  fixed  dimension  exists. 

Proof:  To  prove  this  theorem  the  factorization  theorem  for  suffi¬ 
cient  statistics  is  applied  [Ref.  20].  The  factorization  theorem  states 
that  ^t|n^,  ...  t^n^j  is  a  sufficient  statistic  for  0  if  and  only 
if  there  exist  functions  f  and  h  such  that 


•••  Aje)  =  f  ...  ...  An)  (26) 

where  f  depends  on  ...  An  only  through  •••  and 

where  h  does  not  depend  on  0. 

Assume  a  sufficient  statistic  of  fixed  dimension  exists  and  let 

...  j  be  such  a  sufficient  statistic.  Then,  from  Eqs.  (20) 

and  2b), 


This  is  a  fixed  function  of  the  parameters  tjn^,  ...  Hence,  the 

i  s 

$(efY  A^)  differ  only  in  the  values  assigned  to  these  parameters 
and  each  reproduces  itself  with  respect  to  p(An+^|0). 

Conversely,  assume  p(0  I  A-  )  reproduces  itself  with  respect  to 
pfA^I^)*  Then  there  exist  r  parameters  ,  ...  cr  '  and  a  function 

g  such  that 


p(e|Ar  ...  An)  =  g  (a£n),  ...  c/n),e)  (28) 
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since  it  is  known  that  all  of  these  densities  are  of  the  same  form, 
differing  only  in  the  values  assigned  to  parameters.  From  Eqs.  (20)  and 
(28), 


p(A^,  ...  A^l^)  -  g  ...  •  /  p(A^,  •••  1 0 )  d0 

(29) 

The  last  integral  is  not  a  function  of  0,  since  this  parameter  is 
integrated  out  of  the  equation.  Hence,  by  the  factorization  theorem  for 
sufficient  statistics,  the  a’s  comprise  a  sufficient  statistic  for  0 
of  fixed  dimension. 


D.  REPRODUCING  A  PRIORI  DENSITIES 

By  combining  the  results  in  Theorems  II  and  III,  solutions  can  be 
obtained  to  the  problems  of  determining  when  reproducing-type  densities 
exist  and  of  finding  the  forms  of  any  that  exist. 

First,  it  is  noted  that  the  factorization  in  Eq.  (l9)j  Theorem  II, 
expresses  p(0|A_^,  ...  A^)  as  the  product  of  p(@|A^,  ...  .  and 
another  function  of  0.  Hence,  if  the  densities  p(0),  p(0|A^),  ... 
are  all  t,.>  bo  of  the  same  form,  the  densities  p(0|/^),  p^jA^A^),  ... 
must  all  be  of  the  same  form.  According  to  Theorem  III,  this  means  that 
a  sufficient  statistic  of  fixed  dimension  must  exist. 

Second,  it  may  be  seen  that  if  p(e)  is  to  be  a  reproducing-type 
a  priori  density,  it  must  be  of  the  same  form  as  the  a  posteriori 
density  p(0|,A^,  ...  A^).  Hence,  p(0)  must  be  a  function  of  tne  form 
of  p(0|a^,  ...  A^)  multiplied  by  another  function  of  0.  This  condition 
is  stated  by  postulating  that  p(0)  must  be  of  the  form 


p(e) 


p(e|''_m>  •••  aq)  r(e) 


. . .  Aq  r(e)  &e 


(30) 
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where  p(0|A  ,  ...  A^)  is  calculated  by  choosing  a  sequence  of  sets 

of  "a  priori  observations/’  denoted  by  A^  AQ,*  an<i  aPP^yinS 
Eq..  (20),  and  where  r(0)  is  a  non-negative,  integrable,  but  otherwise 
arbitrary  function  of  0.** 

Conversely,  if  an  a  priori  p(0)  of  the  form  in  Eq.  (30)  be 
assumed,  there  results  for  the  a  posteriori  density 


p(Aj_>  Aje)  p(e) 

j  p(\>  Anfe)  p(e)  de 

P^l/^  •••  A^,A^>  '*•  r(®) 

J  p(0|A_m,  ...  Aq,  A1,  ...  Aq)  r(e)  d@ 

(31) 

where  use  has  been  made  of  Eqs.  (20)  and  (30)  and  of  the  assumption 
that  the  A^ ’ s  are  conditionally  independent  given  0.  If  a  suffi¬ 
cient  statistic  for  0  of  fixed  dimension  exists,  the  same  analysis 
used  in  deriving  Theorem  III  shows  that  both  Eqs.  (30)  and  (31)  are  of 
the  same  form,  and  hence  that  p(0)  is  a  reproducing -type  a  priori 
density. 


The  "a  priori  observations”  are  utilized  to  represent  the  available 
a  priori  information.  In  a  typical  application  the  sets  A_m,  ...  Aq 
are  sets  which  are  thought  a  priori  to  be  typical  sets  of  observations, 
with  the  total  number  of  observations  in  these  sets  a  measure  of  the 
confidence  placed  in  the  a  priori  information  (see  Section  F) . 

Actually,  of  course,  only  the  sufficient  statistics  for  the  a  priori 
observations  need  be  chosen;  it  is  even  possible  to  use  sufficient 
statistics  that  do  not  correspond  to  physically  realizable  sets  of 
observations  (for  example,  a  component  of  the  sufficient  statistics 
corresponding  to  the  number  of  observations  might  not  be  an  integer) 
if  the  form  of  the  probability  density  p(0|A.m,  ...  Aq)  is  unchanged.. 

If  the  observations  are  not  physically  realizable,  the  notation  of  Eq. 
(30)  may  be  slightly  misleading;  it  is  kept  for  the  aid  in  visualizing 
methods  of  generating  reproducing  densities  which  it  provides. 

Rather  than  stating  that  r(0)  itself  is  integrable,  it  would  be 
more  accurate  to  state  that  the  integral  in  the  denominator  of  Eq.  (30) 
exists.  It  will  also  be  assumed  that  similar  integrals,  such  as  those 
in  the  denominator  of  Eq.  (3l)>  exist. 


p(e [a^,  ...  aq) 
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The  following  theorem  has  now  been  proved: 

Theorem  IV:  Assume  that  the  yets  of  observations  A.  and  A.. 
-  i  J 

i  /  j,  are  conditionally  independent  given  6 .  Then  a  rep  reducing -type 

a  priori  density  p(0)  exists  if  and  only  if  a  sufficient  statistic 

for  6  of  fixed  dimension  exists.  Any  reproducing-type  density  that 

exists  is  of  the  form  given  in  Eq.  (30). 

Theorem  IV  is  the  fundamental  theorem  in  the  analysis  of  reproducing- 

type  densities  in  tii- -  case  where  the  conditional  independence  assumption 

is  satisfied.  It  indicates  that  the  learning  process  can  satisfy  the 

definition  of  feasibility  utilized  in  this  report  (see  Chapter  III, 

Section  D)  if  and  only  if  a  sufficient  statistic  of  a  simple  form  exists. 

It  ixlr.j  gives  a  method  for  generating  any  reproducing-type  densities  that 

do  exist.  All  those  that  exist  can  be  generated  by  taking  a  function  of 

G  j f  the  form  of  the  likelihood.  p(.\  ,  ...  A^|@)  of  possible  sets  of 

-rn  O' 

observations,  multiplying  by  an  arbitrary  non-negative  function  of  6 , 

and  then  normalizing.  In  deriving  Eq.  (?0),  this  normalization  was  done 

in  two  steps,  first  normalizing  p(A  ,  ...  A  @)  to  obtain 
*  ’  *  -nr  O' 

p(0  |A  ,  ...  A  ) ,  then  multiplying  by  r(@)  and  renormalizing.  A 
-  m  0 

one-step  normalization  will  suffice,  as  putting  the  definition  of 
p(e|A_ra,  ...  AQ)  [Eq.  (20)]  into  Eq.  (30)  gives 

p(a  ,  ...  a |e)  r(e) 

p(e)  =  - - - - 2 - 

;  P(A_m,  ...  aq|s)  r(e)  de 

Similarly,  Eq.  (31)  niay  be  rewritten  as 

p(A_m,  ...  Aq,  ...  An)  r(e) 

/ 

j  •••  \>  •••  An)  r(e)  dS 

The  existence  of  a  sufficient  statistic  of  fixed  dimension  is  more 
important  than  the  use  of  a  reproducing-type  a  priori  density  as  a 
criterion  for  determining  the  feasibility  of  the  learning  process.  In 
fact,  the  same  arguments  used  to  establish  Theorem  IV  can  be  used  to 
establish  the  following: 
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(31a) 


(30a) 


Theorem  V:  Assume  that  the  sets  of  learning  observations  A.  and 

-  -  i 

A  ,  i  4  are  conditionally  independent  given  G .  Then,  regardless 
J 

of  the  a  priori  density  p(@),  the  density  p(0 | )  reproduces  itself 
with  respect  to  the  likelihood  p(A^|©)  if  and  only  if  a  sufficient 
statistic  for  6  of  fixed  dimension  exists. 

Thus,  if  there  is  no  objection  to  one  reprogramming  of  the  learning 
system  after  the  first  set  of  learning  observations,  it  is  merely 
necessary  that  there  exist  a  sufficient  statistic  of  fixed  dimension. 

The  form  of  the  learning  system  will  remain  fixed  after  this  one  change, 
regardless  of  what  a  priori  p(@)  is  used.  It  may  not  always  be 
obvious,  that  the  form  is  constant,  but  it  will  be  possible  algebraically 
to  manipulate  the  densities  into  the  form  in  Eq.  (19).  Since 
p(0 jA1,  ...  A  )  remains  of  constant  form,  the  whole  density  in  Eq. 

(19)  remains  of  constant  form. 

Another  result  similar  to  that  in  Theorem  V  should  be  pointed  out. 
Regardless  of  what  a  priori  density  p(@)  is  used,  it  is  always 
possible  to  write  the  density  in  the  form  of  Eq.  (30a),  i.e.,  as  a 
reproducing  density.  To  do  this,  it  is  merely  necessary  to  pick  an 
arbitrary  sequence  A  ...  Aq  of  sets  of  "  a  priori  observations” 
and  multiply  both  numerator  and  denominator  of  the  a  priori  density 
by  p(a  ,  . ..  Aq  [0 ) .  Rewriting  the  density  in  this  manner  appears 
physically  meaningless,  however.  Also,  in  view  of  Theorem  V,  little 
appears  to  be  gained  by  such  an  approach.  Although  this  possibility 
snould  be  noted,  it  will  normally  be  neglected  in  this  report.  Unless 
otherwise  stated,  it  is  assumed  that  the  denominator  of  r(fl)  contains 
no  terms  of  the  form  of  the  likelihood  function. 


E.  CONVERGENCE  RATES  WITH  VARIOUS  A  PRIORI  DENSITIES 

In  view  of  Theorem  V,  it  appears  that  the  use  of  nonreproducing 
a  priori  distributions  will  often  give  little  if  any  increase  in  the 
complexity  of  the  learning  system.  If  the  rate  at  which  the  a  pos¬ 
teriori  density  approached  a  delta  function  were  greater  with  a  non¬ 
reproducing  a  priori  density,  the  latter  type  of  a  priori  density 
might  be  preferred  despite  some  small  increase  in  complexity.  It  is 
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easy  to  prove  that  no  appreciable  increase  in  rate  of  convergence  can  be 
obtained  by  choosing  a  different  a  priori  probability  density,  however; 
the  proof  follows. 

Con  older  two  a  priori  densities  pQ(0)  and  p^(0)  and  the 
corresponding  a  posteriori  densities  pq(@|a^,  ...  A^)  and 
p.,(0|/\^,  ...  A  ).  If  p^(0)  and  Pp(0)  are  approximately  the  same 
width,  then  PQ (^  \'\>  •••  A^)  and  p^(0|A^,  . A  )  are  approximately 
the  same  width.  To  show  this,  it  is  assumed  that  pQ(0)  and  p^(0) 

both  have  the  same  mode*  0  and  that  for  some  other  point  0, 

o  1 


Pp(eo)  Pl(eo) 
Po(ei}  pl(ei} 


(32) 


(where  0^  might  be  a  common  3_(3b  point  for  the  a  priori  densities). 
Then,  from  Eqs.  (19)  and  (32) 

Po(el-l'  •••  An)  =  j^ojV  •••  An}  #  Po(eo) 

po(®ll*V  V  p(0JV  •••  An}  po(ei} 

p(fl  I  A..  ...  A  )  p,  (6  )  p,  (0  I  A.,  ...  A  ) 
x  o'- _ iv_  ,  1  o  _  1  o  1  1 _ n_ 

p(eLl-\>  •••  \)  Pl^ll'V  Arl 

(33) 


Hence,  the  two  a  posteriori  densities  narrow  down  equally  fast  as 

more  observations  are  taken. 


F.  GENERALIZATION  OF  THE  THEORY  TO  INCLUDE  DEPENDENT  LEARNING  OBSERVATIONS 

The  results  may  now  be  generalised  to  apply  to  the  case  where  the 
learning  observations  are  not  necessarily  independent  given  0.  The 
procedure  will  be  first  to  give  a  simple  example  of  finding  a  reproducing 
density  without  the  assumption  of  conditional  independence,  then  to  use 

— 

The  mode  of  the  density  is  the  value  of  0  for  which  the  density 
takes  its  maximum  value. 
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this  example  to  deduce  the  changes  necessary  in  the  theory  in  order  to 
cover  the  general  case. 

A  binary  Markov  process  is  a  simple  example  of  a  case  where  the 
observations  are  not  conditionally  independent  given  the  parameters 
characterizing  the  process.  If  it  is  assumed  there  are  two  possible 
states,  1  and  0,  and  if  P. .  is  assumed  to  be  the  probability  of  a 

i  j 

transition  from  state  i  to  state  j,  the  probability  of  observing  a 

1  or  0  at  a  given  time  is  not  a  function  of  the  P’s  alone.  It 

i  j 

also  depends  on  the  previous  digits  observed.  Hence,  the  theory  thus 
far  developed  is  not  directly  applicable. 

Reproducing  densities  for  the  F  ’ s  can  easily  be  found,  however/ 

j.  j 

If  each  consists -of  a  single  observation  and  the  sequence 

(•V  •••  \i }  contains  a  total  of  n^  ones,  of  which  r ^  are 
followed  by  ones,  and  n^  zeroes,  r  ^  followed  by  zeros,  there 
results, 


n  -r 


P(-V  ArJP00,Pll)  =  p(\)p00  ^1_P00^  P11 


11 


(3*0 


where  use  has  been  made  of  the  fact  that  P.-  +  P.,  =1. 

iO  il 

A  reproducing-type  density  can  be  found  for  this  case  in  the  same 
manner  as  before,  picking  the  "a  priori  observations"  ...  \q  ^ 

consisting  of  n'  ones, 


r^  followed  by  ones,  and  n^  zeros, 


rQQ  followed  by  zeros,  and  setting 


In  this  case  the  learning  observations  are  discrete  random 
variables,  while  the  theory  has  been  developed  assuming  the  observations 
were  continuous  random  variables.  There  is  no  difficulty  in  extending 
the  theory  to  allow  observations  which  are  discrete  random  variables, 
however.  The  only  change  necessary  is  replacing  probability-density 
functions  by  probability-mass  functions  in  the  equations;  this  may  be 
verified  by  replacing  Eq.  (2)  by  the  form  of  Bayes’  rule  applicable 
here,  and  developing  the  theory  in  an  identical  manner. 
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P/^P00’FlP 


'  P(A-m)P000(l-P00)n°  r°°  p^(l-pn  J**1  ri1  dP-  dP’ 


L11  *11' 


00  11 


0  0 


The  parameter  has  been  included  as  an  index  for  the  density  in 

Eq.  (35)  since  the  computation  to  be  performed  after  observing 
depends  on  A^.  For  example,  if  is  a  one  and  A^  a  zero,  then 

p.^P00'PuJV 


p(A  )pr°°(l-P  )n°  r°°+1  pril(l-P  )ni  ^ 

M  Vroo  v  oo ; _ n  K_  fill _ 

/VP(A  )pr°°(l_p  )n°’r°0+1  prll(l.p  )nl'rlldF  dp 
K  -m'  00  V  00 J  rll  K  rll'  ^00^11 


0  0 


while  if  A^  is  a  one  and  AQ  also  a  one 


(36) 


pa0^poo,pilI/'i^ 


n„-r'  r, ,  +1 


P(A  )P1°°(1-P  f°  *00  p‘n“(l.P 

v  V  00  v  r00;  11  K  11 ' 


1  1 


I  /  P(A  )pi00(l-P  )n°’r°°  P^^V-P  ) 

J  J  K  -m>r00  ^  W  rll  '  11 ; 


„  1  „  » 

1  11 


4pod“u 


C  0 


(37) 


The  two  expressions,  Eqs.  (36)  and  (37)  differ  in  the  exponent  which 
is  increased  to  allow  for  the  additional  observation.  The  computations 
after  observing  will  differ  similarly  according  to  whether  A^  is 

a  one  or  a  zero.  However,  the  densities  will  always  be  of  the  form  in 
Eq.  (35),  so  the  density  reproduces. 

In  the  case  of  more  general  types  of  dependence,  a  similar  procedure 
can  be  used;  although  the  computation  to  be  performed  may  depend  on  more 
than  the  immediately-preceding  digit.  Such  a  situation  is  treated  by 
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introducing  a  parameter  OL,  which  indicates  the  state  of  the  system 
after  the  i ^  observation.  In  the  most  general  case  0^  may  reflect 
the  complete  past  history  of  the  system.  Using  this  parameter  to  index 
the  densities, 


yv 

p  («|V  ...  A^)  - 


a  |e)p  (e) 

n'  Qq 


r 


(38) 


pa  (V  •••  Anl0)  pa  d0 

0  0 


If  the  original  density  is  of  the  form 


pa  (A-m'  •••  Aol0)  r(e) 

pa  (e)  =  — ^ -  (39) 

°  J  pa  (A-m'  Aoi0)  r(0)d0 

-m 

it  is  found  that: 


\(A1’  Anl0)pa  (A-m'  ^  r(0) 

pa  (0IV  An>  =  7 — - ~ -  (*«) 

°  J  pa rt(Al’  •"  Anl0)Pa  (A-*’  ”•  /bl0)r(0)d8 

0  -m 

But  since  reflects  the  entire  past  history  of  the  system,  it  is 

possible  to  write 


a 


<V 


Aje)  -  pa 


<v 


a  |e,A  , 
n1  7  -m 


V 


(41) 


By  putting  this  expression  in  Eq.  (40),  there  results 


pa  (A-m’  •••  VY  AJ0)  r(0) 


-m 


Pq l(01V  V  ~  r 

J  pa  <A-m’  •••  VY  AJ0)r(0)d0 


(42) 


-m 


The  same  type  of  analysis  as  used  in  the  case  where  the  observations 
were  independent  given  Q  shows  that  Eqs.  (39)  and  (42)  are  of  the  same 
form  if  and  only  if  a  sufficient  statistic  of  fixed  dimension  exists. 
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The  proof  has  now  been  completed  for: 

Theorem  VI:  A  reproducing  a  priori  density  p^  (d)  exists  if 
“  -  -  aQ 

and  only  if  a  sufficient  statistic  for  Q  of  fixed  dimension  exists. 

Any  reproducing  density  that  exists  is  of  the  form  shown  by  Eq.  (39)* 

Even  though  the  densities  reproduce;  the  process  may  not  be  feasible 

if  the  cp’s  Gan  take  on  very  many  different  values.  There  appears  to 

be  nothing  in  the  theory  that  requires  the  number  of  different  possible 

a.'s  to  be  finite,  or  even  countable;  in  order  to  have  the  densities 
1 

reproduce.  Such  questions  are  largely  academic,  however;  as  different 
values  of  the  <X's  normally  mean  different  computations  to  determine 
the  new  density  on  6  (as  in  the  binary  Markov  example)  with  corres¬ 
ponding  changes  in  the  form  of  the  learning  system. 

It  is  possible  to  make  a  statement  similar  to  Theorem  V  in  this 
case  also.  The  a  posteriori  densities  eventually  become  reproducing 
if  and  only  if  a  sufficient  statistic  of  fixed  dimension  exists,  no 
matter  what  a  priori  density  is  used.  The  densities  may  not  begin 
reproducing  before  the  system  goes  through  all  its  possible  states,  or 

distinct  a,'s,  however. 
i 

G.  DISCUSSION  OF  RESULTS 

Solutions  are  now  available  for  the  second  and  third  problems  posed 
at  the  end  of  Chapter  III:  finding  conditions  that  insure  that 
reproducing- oype  densities  exist,  and  finding  methods  for  generating  any 
reproducing -type  densities  that  do  exist.  It  has  been  shown  that  the 
existence  nf  a  sufficient  statistic  of  fixed  dimension  guarantees  the 
existence  of  reproducing  densities,  and  that  any  reproducing  densities 
that  exist  can  be  generated  by  normalizing  a  non-negative  function  of 
G  having  a  factor  of  the  form  of  the  likelihood  of  a  possible  set  of 
observations.  The  existence  of  a  suitable  sufficient  statistic  is  more 
important  than  the  use  of  reproducing  distributions  in  insuring  the 
feasibility  of  the  learning  process,  as  the  sequence  of  a  posteriori 
distributions  eventually  becomes  reproducing  if  such  a  statistic  exists, 
regardless  of  the  a  priori  distribution.  No  appreciable  increase 
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in  rate  of  convergence  of  the  a  posteriori  densities  to  a  delta  func¬ 
tion  can  be  obtained  by  the  use  of  a  non-reproducing  a  priori  density, 
however. 

The  results  apply  either  with  the  learning  observations  conditionally 
independent  given  Q ,  or  without  this  independence.  Without  the  inde¬ 
pendence  assumption,  however,  the  form  of  the  learning  system  may  depend 
on  the  state  of  the  system  determined  by  previous  observations.  If 
many  such  states  are  possible,  the  learning  procedure  may  be  impractical 
even  when  reproducing -type  distributions  are  used. 

The  class  of  reproducing-type  a  priori  densities  of  the  form  in 
Eq.  (ju)  or  (39)  is  large  enough  to  gave  considerable  freedom  in  choosing 
a  priori  densities.  The  a  priori  observations  (or  the  sufficient 
statistics  describing  these  observations)  can  be  chosen  almost  arbitrarily. 
As  the  examples  in  the  next  chapter  show,  this  allows  considerable  free¬ 
dom  in  choosing  the  "experimental  portion"  of  the  a  priori  density. 

The  function  r(@)  can  also  be  used  to-  incorporate  a  wide  variety 
of  forms  of  a  priori  information.  Although  any  non-negative  function 
of  0  can  be  used  for  r(0)  (assuming  the  integrability  requirements 
over  0  are  met),  most  of  these  forms  are  physically  meaningless.  In 
the  next  chapter  are  given  a  few  examples  of  forms  that  r(0)  may  take. 

One  of  the  more  interesting  forms  for  r(0)  is  a  constant.  When  r(0) 
is  constant,  the  a  priori  density  in  Eq.  ( 30 )  or  (39)  is  identical  to 
the  a  posteriori  density  that  would  have  been  obtained  after  actually 
observing  the  "a  priori  observations,"  if  a  uniform  a  priori  density 
had  been  assumed.* 

The  a  priori  knowledge  reflected  by  densities  of  the  forms  in 
Eqs.  (30)  or  (39)  may  be  considered  to  be  of  two  forms:  one  form  equiva¬ 
lent  to  knowledge  that  could  have  been  obtained  from  observations  and 
the  other  form  representing  knowledge  that  could  not  have  been  obtained 
in  this  manner.  Thus,  all  the  knowledge  about  0  incorporated  in  the 

* 

This  argument  breaks  down  if  0  is  defined  over  a  set  of  infinite 
Lebesgue  measure,  since  uniform  densities  over  sets  of  infinite  measure 
have  no  meaning  in  the  conventional  theory  of  probability.  Such  densi¬ 
ties  do  have  meaning  in  the  theory  developed  by  Renyi  [Ref.  21], 
however. 
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’’experimental  portion”  of  the  a  priori  density,  p  (  0  |  A  ,  ...  AQ), 
could  have  been  obtained  from  actual  observations;  this  is  not  necessarily 
true  of  the  knowledge  incorporated  in  r(Q),  however. 

A  simple  measure  of  confidence  in  the  a  priori  knowledge  contained 
in  p(e|-;-  ,  ...  Aq)  is  available.  The  confidence  placed  in  the  portion 
of  the  a  priori  knowledge  reflected  in  the  "experimental  portion”  of 
the  a  priori  density  is  considered  proportional  to  the  size  of  the  set 
of  observations  necessary  to-  generate  this  portion  of  the  density.  In 
each  case  that  has  been  examined  (see  Chapter  VI ),  this  experimental 
portion  of  the  density  approaches  a  uniform  density  as  the  size  of  the 
set  of  observations  approaches  zero,  and  approaches  a  delta  function 
as  the  size  increases  without  limit.  These  are  the  limits  that  would 
be  expected  as  the  amount  of  a  priori  knowledge  approached  zero  or 
approached  complete  knowledge  of  0,  respectively. 

H;  USE  OF  BAYES'  RULE  COMPUTER 

By  applying  the  factorization  theorem  for  sufficient  statistics,. 

Ea.  (31)  can  be  rewritten  as: 

f(  A'm,n),e)  r(S) 

*  )  =  - - - - - - 

f  ...  A'm,n),e)  r(e)  d© 

vhere  the  are  the  components  of  the  sufficient  statistic  for 

the  combined  a  priori  and  a  posteriori  observations.  Since  r(0) 
is  a  fixed  function  of  0,  the  density  in  Eq.  (43)  is  a  fixed  function 
of  G  and  the  parameters  tj  rn>n),  ...  Combining  this  with 

the  previous  results  gives  the  schematic  diagram  drawn  in  Fig.  4  for 
the  Bayes’  rule  computer  in  Fig.  1.  If  reproducing  a  priori  densities 
are  not  used,  the  form  of  the  computer  may  change  initially,  but  will 
eventually  become  that  in  Fig.  4. 

By  incorporating  the  form  of  the  Bayes ’rule  computer  shown  in  Fig. 

4  in  the  model  of  Fig.  1,  a  more  detailed  model  for  the  learning  process 
is  obtained  with  conditionally  independent  observations.  The  chief 
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FIG.  4.  BAYES*  RULE  COMPUTER  WITH  REPRODUCING  A  PRIORI 
DENSITY. 


difference  in  the  model  if  it  were  designed  for  the  case  without 
conditional  independence  would  be  that  the  form  of  the  Bayes'  rule 
computer  might  depend  on  the  value  of  a  .  If  it  be  assumed  that  a 
may  take  on  r  possible  values,  the  learning  process  can  be  illustrated 
by  the  model  shown  in  Fig.  5*  The  computer  selector  in  this  model 
computes  the  value  of  and  feeds  into  the  appropriate  Bayes' 

rule  computer.  If  the  learning  observations  are  conditionally  inde¬ 
pendent  given  B,  the  model  in  Fig.  5  reduces  to  that  in  Fig.  1,  since 
in  this  case  a  may  be  considered  to  be  constant. 

Rather  than  using  different  Bayes'  rule  computers  for  different 
states  of  the  learning  system,  it  may  well  be  more  practical  to  use  one 
computer  with  a  variable  program.  If  this  approach  is  used,  the  computer 
selector  in  Fig.  5  may  be  considered  to  be  a  computation  program  selector. 
The  same  model  applies  with  some  minor  relabeling. 

In  all  the  theory  that  has  been  developed,  it  has  been  assumed  that 
the  equations  deal  with  probability  densities  only,  for  the  sake  of 
convenience.  Any  of  the  densities  can  be  replaced  by  probability  mass 
functions  if  discrete  rather  than  continuous  random  variables  are 
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FIG.  5.  GENERAL  MODEL  FOR  LEARNING  PROCESS. 


encountered. *  Some  of  the  alternate  equations  have  actually  been 
utilized  in  the  example  introducing  the  methods  of  generalizing  to  the 
case  where  the  learning  observations  are  dependent.. 

The  next  chapter  is  devoted  to  examples  of  reproducing -type  distribu¬ 
tions.  These  examples  should  clarify  some  of  the  theory  developed  in 
the  investigation. 


-ft 

The  term  "repro due ing- type  distributions"  is  used  in  the  title  of 
this  report  as  being  more  general  than  "reproducing -type  densities. " 
Probability  mass  functions  may  reproduce  also. 
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VI.  EXAMPLES  OF  REPRODUCING -TYPE  DISTRIBUTIONS 


In  this  chapter  are  given  a  number  of  examples  of  probability 
listributions  that  are  reproducing.  The  two  criteria  that  have  been 
utilized  in  choosing  the  examples  are  the  engineering  utility  of  the 
probability  distributions  involved  and  the  possibility  of  illustrating 
different  properties  of  the  distributions. 

Two  different  classes  of  reproducing  distributions  are  considered. 
For  the  first  class,  called  simple  reproducing  distributions,  r(@) 
is  a  constant  and  hence  p(0)  equals  p(0 | A  ...  A^).  For  the  second 
class,,  called  composite  reproducing  distributions,  r(0)  is  not 
constant.  Hence,  a  composite  reproducing  distribution  is  the  product 
of  a  simple  reproducing  distribution  and  another  function  of  0. 

A.  A  SAMPLE  COMPUTATION:  THE  BINOMIAL  DISTRIBUTION 

The  binomial  distribution  is  probably  the  most  common  discrete 
probability  distribution  in  engineering  applications.  It  might  be 
termed,  in  everyday  engineering  language,  the  "go- -no  go"  distribution. 
This  distribution  can  describe  the  probability  that  a  switch  is  open 
or  closed;  or  the  probability  that  a  signal  corresponds  to  a  one  or  to 
a  zero;  or  a  myriad  of  other  cases  where  only  two  events  are  considered 
to  be  possible.  If  the  probability  P  characterizing  this  distribution 
is  unknown,  the  learning  procedure  developed  in  this  paper  is  applicable. 

It  is  assumed  for  the  sake  of  definiteness  that  the  two  possible 
events  are  the  reception  of  a  one  and  of  a  zero.  If  P  were  known,  it 
would  be  the  probability  of  a  one.  Each  is  assumed  to  be  the 

observation  of  a  single  digit. 

To  find  a  simple  reproducing  density,  a  specific  a  priori  sequence 

A  ...A*  consisting  of  r  ones  and  n  -  r  zeros  is  assumed. 

-n  +1 7  0  &  o  oo 

o 

Making  use  of  Theorem  II,  Chapter  V,  and  the  basic  definition  given  by 

Eq.  (20),  but  replacing  the  symbols  p(A. ,  ...  A,  \0)  for  the  likeliho  d 

i  J 

functions  by  the  discrete  random  variable  analogs  P(a  ,  ...  A  |@) 

i  j 

(since  the  binomial  distribution  is  a  discrete  rather  than  a  continuous 
distribution),  p(P)  is  chosen  to  be 
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p(a 


p(P)  =  p(P|A.n  +1,  ....  \n)  =  - 

O 


-n  +1’ 

o 


•^ip) 


!  p<A.n  +1>  •••  *0lp>  dP 

o 


r  n  -r 

p  °(i-p)  3  0 


r  a  -r 

=  /  /  p  °(i-p)  0  0 


0, 


r(n  +2) 

O 


n  -r 
o  o 


dP 


r(r  +l)r(n  -r  +1) 


otherwise . 


P  °(1-P)  u  0  <  P  <  1, 


m 


The  density  given  by  Eq.  (44)  may  be  recognized  as  a  beta  density 
function.  This  fact  can  be  used  to  check  the  normalizing  constant 
obtained.  Alternatively,  the  normalizing  constant  can  be  obtained  by 
finding  a  standard  probability  density  function  that  depends  on  its 
argument  in  the  same  way  that  the  function  in  Eq.  (44)  depends  on  Pu¬ 
rely  ing  on  the  fact  that  standard  density  functions  are  normalized  to 
integrate  to  one.  In  any  event,  determining  whether  the  density  is  a 
standard  form  is  useful,  since,  if  such  is  the  case,  the  important 
properties  of  the  density  may  have  been  tabulated. 

In  she  equations  for  the  a  posteriori  density  when  a  reproducing 
a  priori  density  is  used,  there  is  no-  distinction  between  effects  of 
a  priori  and  a  pjsteriori  observations.  Hence,  the  a  posteriori 
density  after  observing  a  sequence  consisting  of  r^  ones  and 
zeros  is 


p(p|a1, 


)  -  p(p|A_n  +1>  •••  A0>  \>  •••  An  ) 


r(n  +  n,  +2) 
o  1  ' 


)r(r  +r.  +1 )  r(n  +n  -r  -r_  +  l) 
=yvoi/voioi  9 


0 , 
v. 


r  +r^  n  +nn  -r  -r. 

P  0  1(i-p)  0  1  0  1,  0<P<1 


otherwise 
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r(n  +  2) 


pr  (1  -  p)n'r,  0  <  P  <  1 

J  r(r+l)  r(n-r+l) 

^  0,  otherwise, 


(45) 


where 


n  =  n  +  n. , 
o  1 

A 

r  =  r  +  r.  . 
o  1 


The  mean  and  variance  of  Eq.  (45)  are  given  by 


(46a) 

(46b) 


BtF|/y  ...  AJ  (Ma) 

Var  !F|A.,  . . .  a  ]  .  (r  .  1)  (■>  -  r  .  1)  (Vrb) 

1  0  (n  *2f  (n  +  3) 

As  the  total  number  of  a  priori  and  a  posteriori  observations 
approaches  zero,  the  above  values  approach 


E[P|V  •••  An}-| 

(48a) 

Var  [P^,  ...  An]  -  jg 

(48b) 

These  are  the  values  of  the  mean  and  variance  of  a  uniform  density  over 
the  interval  from  zero  to  one.  Conversely,  as  the  total  number  of 
a  priori  and  a  posteriori  observations  becomes  very  large, 

E[pK,  ...  A  ]  -  lim  J=PC  (49a) 

A  r,n-**> 


Var  [P^,  ...  An]  -  °  (49b) 

These  are  the  values  of  the  mean  and  variance  of  a  delta  function  density 

at  P  equals  P  ,  Moreover,  for  any  finite  number  of  a  priori  observa- 
o 

tions,  the  limiting  ratio  in  Eq.  (49a)  will  be  the  limiting  ratio  of  the 
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values  for  the  a  posteriori  observations,  which,  according  to  the 
strong  law  of  large  numbers,  is,  with  probability  one,  the  true  value 
of  P. 

In  this  case  it  is  easy  to  show  that  the  limiting  forms  of  the 
density,,  for  small  and  large  numbers  of  observations,  are  a  uniform 
density  over  the  interval  from  zero  to  one  and  a  delta  function  at  the 
true  value  of  P.  The  results  are  left  in  the  form  of  Eqs.  (48)  and  (49) 
for  easy  comparison  with  other  reproducing-type  densities  obtained, 
however. 

Sufficient  statistics  for  the  sequences  of  observations  arise 

naturally  from  the  analysis.  The  pairs  of  numbers  (n  fr  ),  (n  ,r  ) 

oo  l  1 

and  (n,r)  are  sufficient  for  the  a  priori,  a  posteriori  and 
total  sequences  respectively. 


B.  SOME  SIMPLE  REPRODUCING-TYPE  DISTRIBUTIONS 

In  this  section,  ten  typical  examples  of  simple  reproducing-type 
distributions  are  analyzed.  The  distributions  treated,  the  unknown 
parameters,  and  the  form  of  the  learning  observations  are  listed  In 
Table  1.  Table  2  gives  the  likelihood  of  the  learning  observations 
and  the  simple  reproducing-type  densities. 

1.  Probability  Distributions  Considered 

Four  discrete  distributions  are  treated:  the  binomial,  the 
multinomial,  the  binary  Markov,  and  the  Poisson.  In  each  case  parameters 
characterizing  the  probability  mass  function  are  unknown.  Six  examples 
of  continuous  distributions  with  some  of  the  parameters  characterizing 
zhe  probability  density  functions  unknown  are  also  treated.  These  include 
three  examples  of  Gaussian  densities,  one  multidimensional  with  unknown 
mean  vector,  one  multidimensional  with  unknown  covariance  matrix,  and 
one  one -dimensional  with  a  complex  mean  and  both  magnitude  and  phase  of 
the  mean  unknown.  The  three  other  cases  are  the  Rayleigh,  the 

- 

In  the  appendix  the  case  of  a  multidimensional  Gaussian  density  with 
both  means  and  covariances  unknown  is  also  treated.  The  simple  reproducing 
density  in  this  case  is  the  composite  Wishart -Gaussian  density  used  by 
Keehn  (see  Chapter  III,  Section  C). 
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TABI.E  2.  SIMPLE  REPRODUCING  DENSITIES 
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exponential,  and  the  zero -mean  rectangular  distributions  with  parameters 
characterizing  these  distributions  unknown. 

Each  of  the  ten  distributions  considered  has  important  engineering 
applications.  The  binomial,  multinomial,  and  binary;  Markov  distribu¬ 
tions  are  important  in  such  fields  as  coding,  hypothesis  testing,  and 
pattern  recognition.  Typical  applications  of  the  Poisson  distribution 
are  in  the  study  of  shot  noise  and  various  waiting  time  and  counting 
problems.  The  Gaussian  densities  occur  so  often  that  little  comment  is 
necessary,  save  for  the  fact  that  the  form  with  a  complex  mean  is  the 
form  that  would  be  used  when  using  complex  numbers  to  indicate  both 
magnitude  and.  phase  information  in  a  single  number.  The  Rayleigh  density 
is  the  probability  density  for  the  envelope  of  a  narrow- band  Gaussian 
random  process  and  (among  other  applications)  is  used  in  the  study  of 
the  fading  of  radio  signals.  The  exponential  density  is  the  density  for 
the  output  of  a  square-law  detector  (square-lav  device  followed  by  a 
low-pass  filter),  with  a  narrow-band  Gaussian  input.  The  final  case, 
the  rectangular  density,  is  useful  in  such  areas  as  the  study  of  systems 
with  unknown  phases  or  an  unknown  time  reference,  or  studies  involving 
the  location  of  an  object  confined  to  a  specific  interval. 

2.  Computation  Methods 

In  computing  the  reproducing  densities  for  Table  2,  subscripts 
to  indicate  that  the  observations  are  "a  priori  observations"  have  been 
omitted.  The  densities  may  be  considered  as  either  a  priori  or  a  pos¬ 
teriori  forms,  since  a  priori  and  a  posteriori  observations  are 
equivalent  in  their  effects  on  the  densities. 

Each  of  the  densities  in  Table  2  was  obtained  in  a  manner  analogous 
to  the  computation  for  the  binomial  distribution  given  in  the  previous 
section.  In  two  cases--the  Gaussian  with  unknown  covariances  (Case  6) 
and  the  Rayleigh  (Case  8)-- it  was  found  convenient  to  define  as  a  new 
parameter  the  inverse  of  the  unknown,  and  tnen  to  find  a  reproducing 
density  for  this  inverse  parameter.  This  was  done  purely  for  the  sake 
of  convenience;  by  writing  the  densities  in  terms  of  the  inverse 
parameters  p  and  K  ^  standard  forms  are  obtained  with  the  normaliza¬ 
tion  constants  and  important  properties  tabulated.  In  each  of  the 
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eight  cases  vhere  standard  probability  densities  vere  obtained  as  the 
reproducing  densities,  the  common  name  for  the  density  obtained  is 
indicated  in  Table  2. 

3.  Analysis  of  Reproducing  Densities 

The  first  case  on  the  list  (the  binomial  distribution)  has 
already  been  discussed  in  some  detail.  The  second  case,  multinomial 
distribution  with  F^'s  unknown,  and  the  third  case,  binary  Markov 
with  ' s  unknown,  are  generalizations  of  the  binomial  case.  It  is 
found  that  the  reproducing  density  for  the  multinomial  distribution 
(which  is  equivalent  to  the  (m-l) -dimensional  generalization  of  the 
binomial  distribution)  is  the  (m-l) -dimensional  generalization  of  the 
beta  density,  i.e.,  it  is  the  Dirichlet  density.  Similarly,  in  the 
binary  Markov  case,  by  assuming  that  the  first  digit  of  the  a  priori 
sequence  for  learning  the  unknown  and  P^  is  chosen  independ¬ 
ently  of  and  P^,  any  interaction  between  these  two  probabili¬ 

ties  is  removed,  so  that  they  can  be  treated  as  independent  random 
variables,  each  distributed  according  to  a  beta  density. 

The  three  cases  discussed  above- -binomial,  multinomial  and  binary 
Markov- -may  be  encountered  in  determining  thresholds  for  likelihood 
ratio  tests  in  pattern  recognition.  It  is  possible,  moreover,  to  utilize 
these  learning  techniques  to  obtain  the  thresholds.  This  may  result  in 
using  variable  thresholds.  This  possibility  is  discussed  in  more  detail 
in  the  next  chapter. 

The  binary  Markov  process  is  an  example  of  a  case  where  a  reproducing- 
type  density  can  be  found  without  assuming  that  the  are  condi¬ 

tionally  independent  given  0.  This  is  the  case  that  was  utilized  to 
introduce  the  method  of  generalizing  to  allow  for  dependent  learning 
observations  in  Chapter  V,  Section  D.  It  is  the  only  example  included 
herein  in  which  the  learning  observations  are  not  conditionally  independ¬ 
ent  given  0.  Other  cases  of  this  type  can  be  treated  in  an  analogous 
manner,  although  most  of  them  will  be  more  complex. 

The  densities  obtained  for  the  multivariate  Gaussian  process  with 
unknown  mean  vector  (Case  5)  and  with  unknown  covariance  matrix  (Case 
6),  and  for  the  case  with  both  mean  vector  and  covariance  matrix  unknown 
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(which  is  included  in  the  appendix),  are  the  densities  that  Abramson, 
Braverman  and  Keehn  have  shown  to  be  of  the  reproducing  type  as  discussed 
in  Chapter  III.  Similarly,  the  densities  given  for  the  binomial  and 
multinomial  cases  are  those  used  by  Bellman  and  Mosimann,  respectively, 
and  a  number  of  the  densities  have  been  used  by  Raiffa  and  Schlaifer. 

The  only  case  mentioned  in  Chapter  III  for  which  it  has  been  found  that 
reproducing- type  densities  have  been  used  but  in  which  the  density  used 
is  not  the  form  in  Table  2  is  that  discussed  by  Turin  [Ref.  13 ] •  The 
density  given  in  Table  2  for  the  unknown  amplitude  and  phase  of  a  com¬ 
plex  Gaussian  mean  is  not  the  Rician  density  used  by  Turin,  although  it 
is  similar.  The  difference  is  discussed  in  more  detail  in  later  sections 
of  this  chapter. 

The  density  given  in  Table  2  for  the  complex  Gaussian  case  (Case  7) 
is  not  as  complex  as  it  may  at  first  seem.  The  density  is  actually 
simple  save  for  the  normalizing  constant.  This  can  be  seen  by  rewriting 
the  density  in  either  of  the  forms 


p(a,0) 


r 


a  -  2a | X  [  cos (0 


or 


■ 


K2  exp 


{-in- 


a  >  0,  -«  <  0  <  n 


Lo, 


otherwise. 


(50) 


with  K_  and  K*  normalizing  constants  chosen  so  that  either  of  the 

a.  d 

forms  of  p(a,0)  in  Eq.  (50)  integrates  to  one. 

The  final  case  on  the  list--rectangular  distribution  with  unknown 
mean- -is  a  rather  off-beat  example.  This  density  violates  some  of  the 
statistical  criteria  for  "regularity,”  since  it  is  not  continuous.  The 
reproducing  density  obtained  also  has  unusual  properties.  It  is  the 
only  case  encountered  in  this  study  where  the  density  is  not  defined 
after  one  observation  because  p(A- |© )  is  not  integrable.  Some  care 
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must  be  exercised  in  picking  "a  priori  observations”  also,  since  these 
must  be  less  than  W  in  absolute  magnitude.  If  this  condition  is  not 
fulfilled,  the  a  priori  p(w)  vill  be  zero  at  the  true  value  of  W 
and  the  a  posteriori  density  cannot  degenerate  at  .the  correct  point. 
Picking  observations  less  than  W  in  absolute  magnitude  may  be  difficult 
if  nothing  is  knovn  about  W. 

4.  Sufficient  Statistics 


Sufficient  statistics  for  each  of  the  various  probability  dis¬ 
tributions  analyzed  can  easily  be  obtained  from  Table  2,  since  the  den¬ 
sities  therein  are  expressed  in  terms  of  the  sufficient  statistics. 

For  the  binomial  distribution  it  is  found  that  n  and  r  (or  r  and 
s)  constitute  a  sufficient  statistic.  Similarly,  for  the  multinomial 
distribution,  rn ,  ...  r^  are  sufficient;  for  the  binary  Markov,  r, 
n. 


1'  m 

^l*  A  00  OI,iAU,  iA  q,  for  the  Poisson,  t 
Gaussian  with  unknown  mean  vector 


11' 


and  n;  for  the  multidimensional 


and  n;  for  the  multidimensional 
n 


Gaussian  with  unknown  covariance  matrix  V  and  n;  for  the  complex 


Gaussian,  |: 
exponential, 


and  n;  for  the  Rayleigh,  K  and  n;  for  the 


n  '  w  '  n 

and  n;  and  for  the  rectangular  density, 


M  and  n. 
n 


5-  Representation  of  a  Priori  Knowledge 


When  using  simple  reproducing  densities,  such  as  those  in  Table 
2,  the  parameters  of  these  densities  can  be  adjusted  to  reflect  a  priori 
knowledge.  A  priori  observations  are  selected  which,  on  the  basis  of 
the  a  priori  information  available,,  appear  representative  of  the  ob¬ 
servations  to  be  expected;  these  observations  are  then  used  to  generate 
the  reproducing  density.*  For  example.,,  in  Case  1,  if  the  probability 
of  obtaining  a  one  for  a  binomial  distribution  were  expected  to  be  about 
'f,  a  beta  density  for  P  with  r  and  s  approximately  equal  would 
be  chosen;  or  if  the  mean  of  a  Gaussian  distribution  (Case  5)  were 
expected  to  be  near  zero,  a  priori  observations  with  a  sample  average 
near  zero  would  be  chosen.  The  degree  of  confidence  in  such  a  priori 
knowledge  is  reflected  in  the  size  of  the  total  set  of  a  priori 


Normally  only  the  sufficient  statistics  for  these  sets  of  a  priori 
observations  would  be  selected,  rather  than  the  observations  themselves. 
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observations,  or  by  the  magnitude  of  such  parameters  as  n,  r  or  t 
in  the  densities  in  Table  2.  If  there  is  reason  to  be  confident  that 
the  a  priori  knowledge  is  approximately  correct,  the  parameters 
indicating  the  size  of  this  a  priori  set  would  be  .large;  if  little 
confidence  is  reposed  in  the  a  priori  knowledge,  the  parameters 
selected  would  be  small. 

In  some  cases,  the  a  priori  knowledge  is  not  in  the  form  of  suffi¬ 
cient  statistics  such  as  those  in  terms  of  which  the  densities  in  Table 
2  are  defined,  but  the  a  priori  knowledge  is  better  described  as  con¬ 
sisting  of  approximately  what  the  value  of  the  unknown  parameter  is 
expected  to  be,  plus  the  approximate  width  of  the  expected  a  priori 
density  (or  the  amount  of  deviation  from  the  expected  value  that  might 
reasonably  be  allowed  for).  In  Table  3  are  listed  important  moments, 
i.e,,.  means,  variances,,  and  covariances,  for  the  reproducing-type  densi¬ 
ties  in  Table  2.,  These  moments  can  be  utilized  to  fit  a  priori 
knowledge  having  the  forms  designated. 

6.  Limiting  Forms  of  Densities 

The  moments  in  Table  3  are  also  useful  in  determining  limiting 
properties  of  the  densities  as  the  size  of  the  set  of  a  priori  observa¬ 
tions  (or  of  the  combined  set  of  a  priori  and  a  posteriori  observa¬ 
tions)  becomes  very  large  or  very  small.  Since  the  size  of  this  set 
indicates  the  degree  of  confidence  reposed  in  the  a  priori  knowledge 
(or  the  combined  a  priori  and  a  posteriori  knowledge),  the  limiting 
forms  would  be  expected  to  be  a  very  narrow  density  approximating  a  delta 
function  for  a  large  set  of  observations,  and  a  very  broad  density 
approximating  a  uniform  density  for  a  small  set  of  observations.  Tables 
4  and  5  indicate  that  these  are  indeed  the  limiting  forms  obtained. 

Table  4  indicates  the  limiting  forms  for  the  moments  obtained  with 
a  large  set  of  observations.  In  each  case  the  means  approach  limiting 
forms  that  are  possible  values  for  the  unknown  parameters,  while  the 
variances  and  covariances  approach  zero.  This  indicates  that  a  delta 
function  is  the  limiting  form  of  the  density  for  a  large  set  of 
observations. 
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TABLE  3.  IMPORTANT  MOMENTS  OF  REPRODI'CING  DENSITIES. 
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mde  fined,  n  <  1  undefi 


TABLE  4.  LARGE  SAMPLE  LIMITS  OF  MOMENTS. 
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TABLE  .i.  SMALL  SAMPLE  LIMITS  OF  MOMENTS. 
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Where  <>  uniform  densit%  not  defined  hecause  of  -he  range  of  tf  having  infinite  Lebesque  measure,  limiting  values  of  moments  for 
densities  approaching  uniformity  are  given  if  the  limit  is  independent  of  the  limiting  process. 


Since  a  priori  observations  and  a  posteriori  observations  are 
treated  in  identical  manners,  Table  k  can  be  used  to  find  the  limiting 
forms  of  the  a  posteriori  densities,  assuming  a  finite  a  priori  set 
of  observations  and  an  increasingly  large  a  posteriori  set.  The 
limiting  form  is  in  each  case  a  delta  function  as  before,  but  the  loca¬ 
tion  of  the  delta  function  can  be  stated  precisely.  In  the  Appendix  it 
is  shown  that,  in  each  case,  the  mean  converges  with  probability  one  to 
the  true  value  of  the  unknown  parameter.  Hence,  the  densities  approach 
delta  functions  at  the  true  values  of  the  unknown  parameters,  or  the 
learning  system  learris  the  true  values  exactly. 

In  Table  5  the  limiting  forms  of  the  moments  are  analyzed  as  the  size 
of  the  set  of  a  priori  observations  approaches  zero.  In  making  this 
analysis,  parameters  indicating  the  size  of  the  a  priori  set  have  not 
been  confined  to  integer  values,  since  the  densities  are  defined  regard¬ 
less  of  whether  these  parameters  are  integer  valued  or  not.  The  pro¬ 
cedure  used  to  find  these  limiting  forms  is  simply  to  let  all  the 
parameters  defining  the  size  of  the  set.  of  a  priori  observations 
approach  zero,  finding  the  limiting  forms  of  the  means,  variances,  and 
covariances  whenever  these  limiting  values  are  uniquely  defined. 

In  Table  5  the  limiting  forms  obtained  for  the  means,  variances,  and 
covariances  are  compared  with  the  means,  variances,  and  covariances  of 
random  variables  distributed  according  to  a  uniform  density  over  the 
range  of  possible  values  of  the  unknown  parameter.  In  some  cases  a 
uniform  density  is  not  defined  over  this  range  because  the  range  is  of 
infinite  Lebesgue  measure.*  In  these  cases  the  moments  tabulated  are 
the  limiting  values  of  the  moments  of  a  sequence  of  random  variables 
with  probability  distributions  approaching  a  uniform  distribution,  if 
the  limiting  values  are  uniquely  defined;  if  the  limits  are  not 
uniquely  defined,  this  is  indicated  in  Table  5*  In  each  case,  exact 
agreement  is  found  between  the  moments  of  the  reproducing -type  densi¬ 
ties  and  the  moments  of  uniform  densities.  If  the  moments  of  either 
are  uniquely  defined,  the  moments  of  the  other  are  also  uniquely  defined 
and  take  the  same  values. 

* 

As  noted  earlier  uniform  probability  densities  over  sets  of  infinite 
Lebesgue  measure  are  allowed  in  the  theory  developed  by  Renyi  [Ref.  21], 
however. 
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Details  of  the  computing  methods  for  all  the  tables  are  given  in 
the  Appendix. 


C.  SOME  COMPOSITE  REPRODUCING -TYPE  DISTRIBUTIONS 

As  indicated  in  the  previous  section,  simple  reproducing- type 
distributions  contain  enough  adjustable  parameters  to  give  considerable 
freedom  in  choosing  a  priori  probabilities.  A  number  of  types  of 
a  priori  knowledge  can  be  reflected  in  these  a  priori  distributions, 
including  values  of  the  parameters  that  are  felt  to  be  typical  and  a 
measure  of  the  confidence  reposed  in  the  a  priori  knowledge. 

Even  more  freedom  in  choosing  a  priori  distributions  is  available 
if  composite  reproducing-type  distributions  are  considered.  As  indicated 
in  Eq.  (30),  a  simple  reproducing-type  distribution  multiplied  by  an 
arbitrary  (except  for  scale  factor)  non-negative  function  of  6  is  still 
a  reproducing-type  distribution.  These  more  complex  reproducing-type 
distributions  have  been  defined  to  be  composite  reproducing-type 
distributions. 

In  this  section  no  attempt  is  made  to  indicate  all  the  possibilities 
of  choosing  composite  repioducing  distributions.  The  discussion  is 
limited  to  two  general  classes  of  composite  reproducing  distributions. 

1 .  Restricting  the  Range  of  0 

One  class  of  composite  rejjroducing  distributions  is  useful  when 
part  of  the  a  priori  knowledge  is  the  fact  that  the  true  value  of  0 
is  contained  in  some  interval  I.  For  example,  it  might  be  desired  to 
detect  a  signal  of  unknown  frequency,  using  a  receiver  of  a  known  finite 
bandwidth.  The  probability  of  receiving  a  signal  outside  the  frequency 
band  accepted  by  the  receiver  would  be  zero.  In  such  a  case  r(@)  in 
Eq.  (30)  may  be  taken  as 


r(0) 


eel 

otherwise 


(51) 


-  62  - 


sEL-63-099 


giving 


pCel^,  ...  Aq) 


p(e)  =  <J  p(© [Aj.,  ...  An)de 


eei 


(52) 


otherwise. 


For  example,  if  6  were  the  unknown  mean  m  of  a  one-dimensional 

2 

Gaussian  distribution  with  known  variance  <x  ,  and  if  it  were  known 
that  a  <  m  <  b,  an  a  priori  density  on  m  might  be  obtained  by 
picking  an  a  priori  set  ^  X  . .  .  X^  ^  of  learning  observations 

(all  confined  to  the  interval  a  <  X^  <  b)  and  setting 

r\(lL IV  Jill) 


p(m)  =  / 


0, 


exp 


nTST  cr 


m  <  b 


otherwise 


(53) 


where 


u 


i=-t+l 


(54) 


2  12 

cr  -  -  cr 
n  n 


(55) 


and  0(x)  is  the  Gaussian  cumulative  distribution  function 


<D(x)  =  — 


-x2/2  ^ 
e  '  dx 


(56) 


4  2n  -00 

2.  Converting  Density  to  Familiar  Form 

Alternatively,  it  may  be  possible  to  choose  r(6>)  in  such  a 
way  as  to  convert  a  probability  density  into  a  more  familiar  form.  For 
example,  if  the  problem  consists  of  learning  both  the  magnitude  and  the 
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phase  of  a  complex  Gaussian  mean  (Case  7)>  the  simple  reproducing 
density  is  listed  in  Table  2  as: 


r 

|—  1 2 

T-l 

‘ 4  n  * 

To 

4j2 

L  n  -J 

p(a,0)  = 


2o2  L 

n 


a2-2a[xj  cos  (0-tJ  +  \  jxj2 


a  >  0,  -rt  <  0  <  rt 
otherwise.  (57) 


If  r(a,0)  is  'caken  identically  equal  to  a,  then  from  Eq.,  (30) 
(writing  the  normalizing  constant  given  by  the  reciprocal  of  the 
denominator  in  Eq.  (30)  along  with  the  other  constant  factors  involved 
as  a  constant  K): 


p(a>0) 


a  >  0,  -rt  <  0  <  rt 
otherwise; 


cos  (0-8ri)+|Xri|2 


a  >  0,  -Jt<0<  rty 


otherwise.  (58) 


The  normalizing  constant  K  was  evaluated  in  the  second  expression  for 
p(a,0)  by  a  procedure  suggested  in  Section  A.  It  was  noted  that  the 
density  depended  on  its  arguments  in  the  same  manner  as  one  of  the 
standard  densities  used  in  statistical  communication  theory;  in  this 
case  the  dependence  is  the  same  as  that  of  the  generalized  Rayleigh  or 
Rician  density  encountered  in  the  study  of  narrow-band  signals  in 
Gaussian  noise.  Hence,  the  normalizing  constant  for  the  Rician  density 
was  used. 
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3,  Other  Possibilities 


Numerous  other  reasons  for  choosing  a  particular  r(0)  may 
occur.  The  form  may  be  determined:  by  reasoning  about  physical  princi¬ 
ples,  to  agree  with  experimental  results,  or  in  numerous  other  ways. 


4.  Computation  of  Density  Needed  in  Chapter  VII 

One  density  of  the  form  in  Eq.  (52)  will  be  needed  in  the  next 
chapter.  Consider  an  event  E  vith  conditional  probability 


P(E |f )  =  Kx  exp 


T 

1 

2 

J  x(t  )ei^3tf"bdt 

0 

(59) 


vith  a  normalizing  constant.  Assume  that  f 


fined  to  the  interval  I  for  vhich 


f  <  f  <  f .  . 

o  —  -  1 


is  known  to  be  con- 
To  obtain  a  reproducing 


density,  select  a  function  y(t)  vhich  is  defined  for  -Tq  <  t  <  0  and 
let 


p(f)  = 


F» 

0 

2 

(  K  exp 

q£_ 

N 

/  y(t)  ei2rtftdt 

< 

0 

-T  ^ 

■> 

0 

0, 


f  <  f  <  f, 
,  o  -  -  1 


otherwise. 


(6o) 


vhere  is  another  normalizing  constant.  (The  normalizing  constants 

are  not  evaluated  in  this  example  since  they  are  complex  and  are 
unnecessary  for  the  later  analysis. ) 

The  a  posteriori  density  after  observing  the  event  E  is  then 


p(f|E) 


f 

o 


<f  <fl 


otherwise. 


(61) 


where 


r y(t),  -t  <t<o 

*(t)  =J  0 

\^x(t ),  0  <  t  < 

since  x(t)  and  y(t)  are  defined  on  disjoint  time  intervals. 


(62) 
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D.  COMPARISON  WITH  RESULTS  OBTAINED  BY  OTHER  INVESTIGATORS 


Reproducing-type  distributions  are  used  in  a  number  of  papers  sur¬ 
veyed  in  the  literature.  Results  obtained  in  this  investigation  may 
be  briefly  compared  with  those  in  a  few  of  the  papers  in  which 
reproducing-type  distributions  are  used. 

1.  Abramson,  Brave rman,  Keehn,  Bellman,  and  Mosimann 

As  already  noted,  the  densities  that  Abramson,  Braverman,  Keehn, 
Bellman  and  Mosimann  [Refs.  7“12]  used  are  the  same  densities  as  the 
simple  reproducing-type  densities  developed  in  this  investigation  for 
the  cases  considered.  The  present  study  has  developed  methods  for 
generating  these  densities  rather  than  finding  them  by  an  heuristic, 
or  trial-and-error,  process. 

2.  Daly 

Daly's  problem  [Refs.  l6  and  17]  cannot  be  solved  by  the 
methods  developed  in  the  present  investigation,  since  for  his  densities 
no  sufficient  statistics  of  fixed  dimension  exist,  with  the  consequence 
that  no  reproducing  a  priori  density  exists.  In  fact,  the  density 
Eq.  (ll)  that  was  given  in  the  discussion  of  a  simple  case  of  Daly's 
problem  is  a  special  case  of  the  density 

2  2 

p(x|m1,m2,o-;L,a2,p) 


Dynkin  [Ref.  19}  shows  that  for  the  density  in  Eq.  (63)  no  suffi¬ 
cient  statistic  of  fixed  dimension  exists  if  any  one  of  the  parameters 
2  2 

m^,  m^,  cr^,  cr^  or  P  is  unknown. 

3.  Raiffa  and  Schlaifer 

Raiffa  and  Schlaifer  [Ref.  15  ]  utilize  reproducing  densities 
in  a  large  portion  of  their  work  on  statistical  decision  theory.  Their 
"natural  conjugate"  a  priori  densities  are  the  same  form  as  the  simple 
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reproducing-type  densities  in  the  present  investigation.  Raiffa  and 
Schlaifer  do  not  utilize  any  specific  set  of  a  priori  observations  to 
generate  the  reproducing  density,  however,  merely  saying  that  the  den¬ 
sity  is  generated  by  the  kernel  of  the  sufficient  statistic  for  the 
likelihood  [the  function  f(t  ,  ...  t  ,Q)  in  Eq.  (27)].  The  a  priori 

-L  S 

observations  have  been  utilized  in  the  present  work  largely  as  an  aid 
to  visualizing  the  process  of  generating  reproducing-type  distributions, 
and  of  utilizing  the  distributions  to  reflect  a  priori  knowledge. 

For  small  samples  at  least,  a  difficulty  with  the  Raiffa-Schlaifer 
approach  lies  in  ascertaining  the  number  of  observations  to  which  the 
a  priori  knowledge  is  equivalent--a  problem  discussed  on  pages  62-67 
of  the  work  cited  [Ref.  15],  and  also  discussed  in  earlier  sections  of 
this  report.  An  example  of  the  difference  in  methods  of  interpretation 
is  the  case  of  learning  the  probability  P  characterizing  a  binomial 
distribution.  Raiffa  and  Schlaifer  consider  the  knowledge  reflected  in 
the  density  Eq.  (44)  to  be  equivalent  to  nQ+2  observations,  since 
Eq.  (44)  is  a  valid  probability  density  for  n^+2  >  Oj  while  in  this 
paper  the  knowledge  is  considered  to  be  equivalent  to  n^  observations. 
As  Raiffa  and  Schlaifer 's  equivalent  number  of  observations,  n^+2, 
approaches  zero,  the  a  priori  density  degenerates  into  a  probability 
mass  function  with  mass  divided  between  zero  and  one,  a  fact  that  the 
authors  discuss  at  some  length.  No  matter  how  many  a  posteriori  obser¬ 
vations  are  then  made,  the  density  remains  degenerate.  In  contrast,  in 
the  present  investigation  as  the  equivalent  number  of  observations  nQ 
approaches  zero,  Eq.  (44)  approaches  a  uniform  density  (see  Table  5)-" 
a  much  more  reasonable  result. 

Raiffa  and  Schlaifer  also  confine  their  work  entirely  to  simple 
reproducing  densities  ("natural  conjugate"  densities).  They  make  no 
mention  of  any  other  form  of  densities  which  may  reproduce. 

4.  Turin 

Turin  [Ref.  13]  utilizes  a  slight  modification  of  the  composite 

reproducing  density  Eq.  (58)  for  learning  the  characteristics  of  a  radio 

channel.  He  assumes  that  a  known  signal  Y  =  (y  ,  . . .  y  )  is  trans- 

x  n  z 

mitted  over  a  channel  with  amplification  a  and  phase  shift  0,  so 
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)0 

that  the  received  signal  is  X  =  a  Y  e°  .  Assuming  additive  Gaussian 

2 

noise  vith  mean  zero  and  variance  c r  : 


p(X  fa, 0,1 


n 

\ 

'  1  10 1 2  " 

\  exp 

L  ^ 

, 1  Va  H  e  1 

(64) 


This  equation  is  the  same  as  the  basic  equation  developed  in  the 
present  study  for  the  likelihood  of  a  complex  Gaussian  process  vith 
unknown  mean  (Case  7*  Table  2),  save  for  replacing  the  constant  a  by 
the  variable  ay^«  Following  the  same  procedure  used  in  the  present 
paper  in  analyzing  the  complex  Gaussian  case,  and  assuming  Y  is  known, 
there  is  obtained  for  a  simple  reproducing  density  on  (a,0), 


P(a,0) 


with 


T-l 

Jo 

Ucr2 

L  0  J 

exp 


0, 


a2- 2a  R 
r 


cos  (0-&n)  +  |  R2 


a  >  0,  -it  <  0  <  jt 
otherwise.  (65) 


(66a) 


(66b) 


(66c) 
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This  density  reduces  to  that  shovn  in  Table  2  if  is  taken  equal 

to  one  for  all  i. 

On  the  basis  of  reasoning  about  the  physical  process  he  is  con¬ 
sidering,  Turin  picks  as  a  priori  density  on  (a,0)  the  Rician 
density 


'  exp  (■ 

2jkt  ^  2a 


[a^  -  2a  R  cos  (0-6)  +  R^] 


|j,a  >0 


P(a,0)  =( 


0, 


otherwise. 


,  -rt  <  0  <  it 


(67) 


which  corresponds  to  Eq.  (58)  in  the  same  way  that  Eq.  (65)  corresponds 
to  the  density  in  Case  7>  Table  2.  Thus,  Turin’s  density  is  a  composite 
reproducing  density  with  r(a,0)  equal  to  a.  The  analysis  developed 
in  the  present  study  shows  why  Turin’s  density  reproduces  itself,  and 
also  indicates  how  alternative  reproducing  densities  which  may  agree 
more  closely  with  experiment  may  be  found. 

Reproducing  distributions  are  doubtless  used  elsewhere  in  the 
literature.  The  treatment  described  in  the  present  paper  is  more  general 
and  thorough  than  any  others  that  have  been  found  in  the  literature 
search,  however. 
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VII.  APPLICATIONS 


A.  PATTERN  RECOGNITION,  EXPONENTIAL  DENSITIES 

In  the  previous  work  by  Abramson,  Braverman,  and  Keehn  discussed  in 
Chapter  III,  reproducing  distributions  were  applied  to  a  pattern- 
recognition  process  with  learning.  Using  the  methods  developed  in  the 
present  study,  it  is  easily  possible  to  generate  reproducing  distribu¬ 
tions  for  learning  a  wide  variety  of  parameters,  thus  obtaining  obvious 
generalizations  of  the  Abramson,  Braverman,  and  Keehn  techniques.  One 
application  similar  to  (but  in  some  respects  more  complex  than)  the 
applications  discussed  by  Abramson,  Braverman,  and  Keehn  involves 
learning  the  parameters  of  a  non-Gaussian  density,  and  in  addition 
learning  the  probability  of  a  pattern  and  using  this  to  adjust  a 
threshold. 

Consider  a  variation  of  the  pattern-recognition  problem  discussed  in 
Chapter  III.  It  is  again  desired  to  find  a  decision  rule  minimizing  the 
probability  of  error  in  recognition.  Equation  (8)  and  the  discussion 
that  accompanies  it  indicate  that  the  optimum  decision  rule  picks  the 
pattern  for  which  p(x|i)P(i)  is  maximum. 

For  simplicity  assume  two  possible  patterns,,  designated  by  the 
indices  1  and  2.  The  optimum  decision  rule  is  then: 


d(X)  = 


otherwise 


(68) 


If  it  be  assumed  that  p ( X | i )  is  an  exponential  density  with 
parameter  Eq.  (68)  becomes 


d(X) 


if 


(*2-\)x 

e 


otherwise 


(69) 
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or 


if  (\2  -  XL)X  >  in  ||||  +  in  ^  (70) 

othervise 

nor  the  P(i)  is  known,  the  learning  procedure 
developed  in  this  investigation  is  employed.  To  learn  the  the 

simple  reproducing  density  for  this  case  (No.  9  in  Table  2)  is  used.  As 
an  a  priori  density  on  V  the  gamma  density  given  by 

p(\)  =  —  (Coi  X  )  01  e  01  1  (71) 

noi 


When  neither  the 


is  assumed.  This  gives 


p(x|i)  -  J  p(xji,  X.)  p(X1)dXi 
0 


=  ¥7  •  Ti  twtj  noi+2 


(72) 


It  is  also  desired  to  learn  the  probabilities  P(i).  Letting  P(l) 
equal  P  and  P(2)  equal  1-P,  it  is  seen  that  P  is  the  parameter 
characterizing  a  binomial  distribution.  Use  is  again  made  of  a  simple 
reproducing  density  (in  this  case  No.  1  in  Table  2).  The  number  of 
times  each  pattern  occurs  in  the  "a  priori  set  of  observations"  is 
already  known;  the  parameter  nQ^  in  Eqs.  (71)  and  (72)  corresponds 
to  the  number  of  observations  of  pattern  i.  Substituting  n  ^  and 
n  g  for  the  corresponding  parameters  r  and  s  in  Case  1  of  Table  2: 


p(p) 


r(no+2) 


r (n0l+1 )r (no2+1 ^ 


n  -  n  ^ 

p  01  (i-p)  02 


(73) 
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where 


A 


n  .  +  n 
ol  o2 


(7U) 


Then,  applying  the  standard  statistical  procedure  for  computing 
marginal  probabilities 


(75) 


since  p(l|p)  =  P,  P(2|P)  =  1-P.  The  optimum  decision  rule  then  becomes 


if  ^  °2^  >  ^n02  +  1^^n0  +  #  ^no2  +1^Co2 

(1  +  X/Cq1)  01  (nol  +  1^^no  +  2 )  (nol  +1^Col 

otherwise  (76) 


If  classified  learning  observations  are  then  taken,  with  n^ 

from  class  i,  an  "a  posteriori  decision  rule"  of  identical  form 

results  except  for  replacing  n  .  by  n  .+n.  .  =  n..,  n  by 
^  oi  01  li  ti7  o  * 

n  +n.  =  n, ,  and  C  .  by  C  ,+C. .  =  C..  (with  C_  .  the  sum  of  the 

0  1  t7  01  47  oi  li  ti  x  li 

X  that  correspond  to  the  ibh  pattern).  The  optimum  decision  rule 
0 

after  n^  observations  is: 


1, 


d  (X)  = 

nl 


if 


>  ^o2+l)/(no+2) 
”  (ntl+l)/(nt+2) 


otherwise 


(nt2+1)/Ct2 

(ntl+1)/Ctl 


(77) 
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Since  (n^+lVCn^+S)  is  an  estimate  of  P(i), 
by  P(i ) .  Similarly,  (n..+l)/C  .  is  designated  by 

til  "Cl 

an  estimate  of  the  parameter  \  .  Taking  logarithms 


it  is  designated 
since  it  is 
in  Eq.  (77): 


The  quantity  x/C^  can  normally  be  expected  to  be  of  the  order 
l/n  .  Hence,  after  a  few  observations,  the  first  term  in  the  expansion 

"Cl  ~ 

of  the  logarithm  becomes  predominant  and  higher-order  terms  can  be 

neglected.  After  a  few  observations  it  is  also  possible  to  neglect  the 

difference  between  n^.+2  and  n^.+l.  After  a  few  observations,  then, 

ti  ti  99 

the  optimum  decision  rule  given  by  Eq.  (77)  is  closely  approximated  by 
the  decision  rule 


1,  if  (X  -  \  )X  >  in  +  zn  -2 


dn1(X)  ‘  X 


P(l) 


0,  otherwise. 


(79) 


This  is  of  the  same  form  as  Eq.  (70).  Hence,  it  may  be  concluded 
that  after  a  few  observations  are  taken,  the  optimum  decision  rule  is 
closely  approximated  by  a  rule  that  is  of  the  established  form  for 
Known  statistics,  but  which  utilizes  estimates  of  the  parameters  in 
place  of  the  parameters  themselves. 

The  approximate  decision  rule  derived  in  Eq.  (79)  can  be  implemented 
as  shown  in  Fig.  6  by  a  device  of  the  form  which  would  be  applicable 
with  known  parameters,  but  with  variable  components. 

A  A 

Since  the  may  take  on  any  positive  values  and  the  P(i)  any 

values  between  zero  and  one,  the  Bayes’  decision  rules  computed  from 
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FIG.  6.  PATTERN  CLASSIFIER  FOB  EXPONENTIAL  DENSITIES. 


(79)  can  assign  all  X‘s  below  any  real-number  threshold  to 
class  1  and  those  above  the  threshold  to  class  2;  or  vice  versa.  In 
other  words,,  any  non- randomized  decision  rule  based  on  a  single 
threshold  is  a  possible  Bayes'  rule. 

The  estimate  of  each  of  the  parameters  used  in  Eq.  (79)  converges 
with  probability  one  to  the  true  value  of  the  parameter.  Hence,,  the 
limiting  form  of  the  decision  rule  given  in  Eq.  (79)  is  identical  to 
the  rule  that  would  be  used  if  all  the  parameters  were  known.  This 
again  could  be  any  non- randomized  decision  rule  based  on  a  single 
threshold. 


B.  FINDING  EXPECTATION  OF  A  RANDOM  VARIABLE 

Another  class  of  problems  for  which  reproducing  densities  are 
applicable  is  that  of  finding  the  expectation  of  a  random  variable. 

More  precisely,  reproducing  densities  are  useful  in  cases  where  a 
probability  density  is  required  that  will  adequately  represent  a  priori 
information  and  at  the  same  time  allow  the  expected  value  of  a  non¬ 
negative  random  variable  to  be  expressed  in  a  simple  form.  This  type 
of  problem  may  be  illustrated  by  considering  the  problem  of  detecting 

•X- 

a  cosine  of  unknown  frequency.  Two  possible  hypotheses  are  assumed: 

— 

This  example  was  suggested  and  first  worked  out  by  Professor 
Norman  Abramson,  Stanford  University. 
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r 


(80) 


where 


Hx:  X(t)  =  S(t)  +  H(t)  ^ 

H2:  X(t)  =  N(t)  J 

S(t)  =  a  cos  (cat  +  0),  co  =  2nf 


(81) 


and  N(t)  is  white  noise,  or  noise  with  a  flat  spectrum  S^(f)  5  Nq/2 
(at  least  over  the  frequency  range  fQ  <  f  <  f^). 

It  is  assumed  that  the  parameters  a,  0,  and  f  (or  co)  are  all 
unknown,  although  the  following  are  known:  (l)  that  0  is  uniformly 
distributed  over  the  range  0  <  0  <  2it;  (2)  that  a  is  Rayleigh - 
distributed  with  parameter  A  ;  and  (3)  that  f  is  restricted  to  the 
frequency  range  fQ  <  f  <  f^.  It  is  desired  to  use  a  likelihood  ratio 
test,  comparing 


i(x) 


pOcjH^ 

p(x|h2) 


(82) 


with  some  threshold. 

If  a,  0,  and  f  were  known,  the  likelihood  of  a  sample  X(t), 
0  <  t  <  T^,  would  be 


In  writing  the  last  form  of  the  equation  it  has  been  assumed  that 
is  large  in  comparison  with  l/f  ,  so  that  the  integral  of  the  cosine- 
squared  term  is  approximately  Jjr  regardless  of  ca  or  0. 

It  is  shown  in  the  Appendix  that,  with  the  likelihood  given  in 
Eq.  (83)  and  with  the  probability  densities  assumed  for  0  and  a, 
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(84) 


i(x|f)  =  Itj.  exp 


2 


with 


(85) 


It  is  desired  to  find  a  probability  density  p(f)  which  will  give 
a  reasonably  simple  form  for  i(x)  and  at  the  same  time  accurately 
reflect  any  information  that  is  available  about  f.  Such  a  density  is 
obtained  by  following  the  same  process  that  was  used  in  finding  repro¬ 
ducing  densities.  Although  i(x|f)  is  not  a  probability  density,  it  is 
non-negative.  If  i(x|f)  were  normalized  to  integrate  to  one,  it  would 
satisfy  the  formal  requirements  for  a  probability  density.  This  is  the 
same  procedure  used  to  derive  reproducing-type  densities  from  likelihood 
functions;  this  suggests  deriving  a  density  for  f  in  the  same  manner. 
Such  a  density  was  derived  in  Chapter  VI,  Section  C,  and  is  given  by 
Eq.  (60).  Utilizing  the  density  in  Eq.  (6o)  for  f  gives 


with 


f 


1 


i(x)  -  K 


f 

o 


T  <  t  <  0 

o  — 

°  <  t  <  t 


(86) 


(87) 


and  a  new  constant  that  may  be  absorbed  into  the  threshold  for  the 

likelihood-ratio  test. 

Without  specifying  X(t)  and  Y(t)  more  definitely,  the  integrals 
in  Eq.  (86)  cannot  be  evaluated.  However,  the  following  points  may  be 
noted.  If  Tq  is  small,  the  frequency  information  in  Eq.  (86)  is 
primarily  determined  by  X(t);  if  is  large,  the  information 
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is  primarily  determined  by  Y(t),  Hence,  Tq  is  a  measure  of  confidence 
in  the  a  priori  information.  Also,  by  proper  choice  of  Y(t),  p(f) 
can  be  caused  to  peak  around  any  desired  frequency  band.  The  density 
given  by  Eq,.  (60)  appears  to  be  the  only  one  yet  found  with  these 
properties,  which  are  important  for  this  application. 

C.  ESTIMATING  A  PARAMETER  WITH  NO  A  PRIORI  INFORMATION 
1.  Bayes  Estimates 

In  order  to  compute  the  Bayes  estimate  of  a  parameter  it  is 
necessary  to  specify  an  a  priori  probability  distribution  for  the 
parameter.  If  no  information  about  this  distribution  is  available,  and 
if  no  reason  is  known  for  favoring  some  values  of  the  parameter,  a  uni¬ 
form  a  priori  probability  distribution  is  the  logical  assumption.  It 
is  only  possible  to  assume  a  uniform  distribution  if  the  range  of  the 
parameter  Is  of  finite  Lebesgue  measure,  however.* 

The  techniques  developed  in  this  investigation  can  be  used  to 
eliminate  this  difficulty.  To  illustrate  the  procedure,  assume  that  it 
is  desired  to  estimate  a  parameter  cd,  and  that  a  squared- error  loss 
function  is  involved: 


L(cd,  co)  =  (ou  -  cd)2  (88) 

where  d)  is  the  available  estimate  of  cd.  It  is  well  known  [Ref.  20] 
that  the  Bayes  estimate  -for  this  case  is  the  a  posteriori  expected 
value  of  cd,  or 


2>(x)  =  j  CJJp  (cd  j  x)  dCD  (89) 

with  X  the  observation  that  is  being  utilized  to  estimate  cd. 

The  function  p(cd|x)1s  an  a  posteriori  density  function,  of  the 
form  that  has  been  studied  in  this  investigation.  If  it  is  desired  to 

As  mentioned  earlier,  uniform  densities  over  ranges  of  infinite 
measure  are  allowed  in  the  theory  developed  by  Renyi  [Ref.  21]. 
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approximate  the  form  that  the  Bayes  estimate  -would,  take  with  a  uniform 
a  priori  density  over  o>,  a  reproducing-type  a  priori  density  on  o> 
can  be  assumed,  then  the  size  of  the  set  of  a  priori  observations  can 
be  allowed  to  approach  zero.  It  has  been  shown  that  the  reproducing 
density  then  approaches  a  uniform  density.  At  the  same  time,  however, 
the  a  posteriori  density  p(oi|x)  approaches  p(u)|x),  if  this  latter 
density  is  defined.  (This  may  be  seen  by  examining  the  form  of  Eq.  (31) 
as  the  size  of  the  set  of  a  priori  observations  approaches  zero,  with 
r(0)  set  equal  to  a  constant.) 

The  following  result  is  thus  obtained:  The  limiting  form  of  the 
Bayes  estimate  of  as  the  a  priori  density  on  a)  approaches 
uniformity  is  given  by 


£'(x)  =  j  cd  p(o)[x)  dco  (90) 

where  p(o)]x)  is  an  "experimental”  probability  density  of  the  form 
defined  in  Eq,  (20) . 

*  _ 

If  the  estimate  is  based  on  a  sequence  of  measurements  ^X-^,  .  .  .  X  \ 
the  same  result  is  obtained,  but  with  p(o>jx)  replaced  by  p(cs|x^,  ... 

X  ).  The  Bayes  estimates  are  given  by  the  mean  values  listed  in  Table  3 
for  the  cases  studied  in  this  investigation;  no  distinction  was  made 
between  a  priori  and  a  posteriori  observations  in  making  up  this 
table. 

The  derivation  given  above  is  based  on  the  assumption  of  a  squared- 
error  loss  function.  Bayes  estimates  with  other  loss  functions,  if  they 
can  be  evaluated,  are  also  given  in  terms  of  a  posteriori  densities. 
Estimates  with  no  a  priori  knowledge  would  be  obtained  in  a  manner 
analogous  to  that  just  described. 

2.  Maximum- Likelihood  Estimates 

Maximum- likelihood  estimates  are  often  used  instead  of  Bayes 
estimates  if  no  a  priori  information  is  available.  The  techniques 
discussed  in  this  report  can  also  be  used  to  simplify  the  procedure  for 
obtaining  maximum- likelihood  estimates.  These  estimates  correspond  to 
the  mode  of  the  likelihood  function,  or  the  value  of  cj a  for  which  the 
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likelihood  function  is  maximum.  This  mode  is  also  the  mode  of  the 
"experimental  portion"  of  the  a  posteriori  density,  since  this  portion 
is  simply  a  normalized  version  of  the  likelihood  function.  If  the 
"experimental  portion"  of  the  density  is  of  fixed  form,  the  mode  can 
normally  he  expressed  as  a  fixed  function  of  the  parameters  characterizing 
the  density.  Expressing  the  parameters  characterizing  the  density  in 
terms  of  the  sufficient  statistics  for  the  observations,  and  the  mode 
in  terms  of  these  parameters,  a  recursive  method  for  computing  the  maxi¬ 
mum  likelihood  estimates  is  obtained.  The  maximum- likelihood  estimates 
may  in  this  manner  be  expressed  as  explicit  functions  of  the  observations. 

The  two  methods  discussed  above  for  estimating  parameters  when  no 
a  priori  information  is  available  are  not  equivalent,  although  the 
difference  is  negligible  for  large  numbers  of  learning  observations. 

For  example,  in  estimating  the  parameter  P  of  a  binomial  distribution, 
the  maximum  likelihood  and  Bayes  estimates  are  r/n  and  (r+l)/(n+2) 
respectively,  while  in  estimating  the  covariance  matrix  of  a  Gaussian 
density,  the  corresponding  estimates  are  V n/n  and  Vn/(n+d+l). 
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VIII.  SUMMARY  AND  CONCLUSIONS 


A  model  has  been  developed  for  a  learning  technique  capable  of 
utilizing  and  evaluating  statistical  information  relating  to  a  physical 
system  or  process.  Characteristics  of  the  technique  are  as  follows: 

A.  BASIC  ASSUMPTIONS 

1.  A  body  of  statistics  is  available,  or  can  be  obtained,  about  the 
system  or  process  under  study. 

2.  In  these  statistics  there  are  one  or  more  parameters,  denoted  by 
0,  whose  values  are  unknown. 

3.  Each  unknown  parameter  0  can  be  treated  as  a  random  variable 
having  a  probability  density  p(0)  over  the  range  of  its  possible 
values.  (The  expedient  of  treating  0  in  this  manner  is  typical 
of  the  "Bayesian"  approach  to  probability  theory.  ) 

4.  A  priori  information  is  available  to  aid  in  choosing  the 
probability  density  p(0).  This  a  priori  information  can 
involve  information  gained  from  a  knowledge  of  the  physical 
principles  involved  in  the  process,  information  gained  from 
experience,  or  information  gained  in  other  ways. 

p.  It  Is  possible  to  perform  experiments  on  the  system,  yielding 
sets  of  learning  observations  A^,  ...  A^. 

6.  The  likelihood  of  each  set  of  learning  observations  A  is 
known  as  a  function  of  0,  and  is  designated  as  p ( A  1 0 ) - 
(When  viewed  as  a  function  of  0  for  fixed  A^,  p(A  |0)  is 

called  a  likelihood  function;  when  viewed  as  a  function  of  A. 

1 

for  fixed  0,  p(Ai|0)  is  called  a  conditional-probability- 

density  function. ) 

7.  The  learning  observations  A ...  A^  are  used  only  to  gain 
knowledge  about  0,  and  do  not  influence  the  values  of  0. 

6.  A  random  variable  Z  may  be  selected  to  represent  some  desired 
criterion  of  system  performance,  such  as  the  fraction  of  the 
time  the  system  makes  an  error,  or  some  other  error  function. 
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9.  The  excellence  of  system  performance  may  be  judged  by  the 
statistical  expectation  of  Z,  E[Z],  where 

E[Z]  =  /  E[Z\e]  p(9)  d@  (1) 

10..  In  the  above  equation,  E[Z|'0]  is  the  conditional  expectation 
of  Z  given  0,  expressed  as  a  function  of  0,  and  is 
independent  of  A^,  ...  A  .  E[z|0]  is  assumed  to  be  known 

a  priori. 


B.  DEVELOPMENT  OF  BASIC  LEARNING  MODEL 
1.  Apply  ’’Bayes1  rule”  to  obtain 


p(a.  |e)  p(e) 

p(e|^)  - - - - 

J  p(\|e)  p(e)  de 


(2) 


where 

p(0  |A^)  =  a  posteriori  probability  density  of  0 

=  probability  density  of  0  evaluated  in  the 
the  light  of  the  set  of  learning  observations 

and  also 


p (0)  =  a  priori  probability  density  of  0, 

p(A_L  j 0 )  =  likelihood  of  the  learning  observations 
2.  Then  Eq.  (l)  becomes 


V 


e[z|/^]  =  j  E[z|e]  P(efA1)  de 


(3) 


where 

E[Z|a^]  =  statistical  expection  of  Z,  in  the  light  of  the 
learning  observations 

and 

E[zj0]  =  conditional  expectation  of  Z  given  0,  expressed 
as  a  function  of  0. 
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3.  An  additional  set  of  learning  observations  is  obtained 

and  Bayes’  rule,  Eq .  (2)  is  again  applied  to  obtain: 


p(Ap|e,A,  )p(0|a.,  ) 

p(0|Va2)  =  - — - — - — — 

J  p(A2|e,A1)p(0|A1)  d0 

4.  The  process  is  repeated  to  yield,  eventually, 


(*) 


p(e|\»  •••  \) 


p(Anle>V  An-l^P^K,,,,An-l^ 

P (^n \9  •  ••  An_^)p(0  |A^,  ...  A^_^) 


de 


(5) 


where 


...  An) 


=  the  a  posteriori  probability 
density  of  0  in  the  light  of  the 
first  n  sets  of  learning 
observations; 

p(A^  |0, Ai,  • • ,  A  =  the  likelihood  of  the  n"^  set  of 

observations  given  the  first  n-1 
sets  of  observations. 

5.  If  it  be  assumed  that  the  sets  of  learning  observations  are 

conditionally  independent  given  0,  Eq.  (5)  niay  be  simplified  to: 


p(©|\>  •••  An)  =  ~ 


p(An|0)p(0|A1,  ...  Aq-1) 


(6) 


J  p(An|e)p(e|A1^  An.p) 

6.  Using  Eq.  (6)  above  (or  Eq.  (5)),  Eq.  (3)  is  expanded  to  give: 


E  [Z  j/^,  ...  An]  =  J  E[Z|e]p(0|A1,  ...  An) 


d0 


(7) 


7.  Equations  (6)  and  (7)  above  form  the  basis  for  the  learning  model 
illustrated  in  Fig.  1  of  the  report. 
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C.  CONDITIONS  FOR  FEASIBILITY  OF  THE  LEARNING  PROCESS 


1.  The  learning  technique  described  above  may  be  considered  to  be  a 
practical  learning  process  if: 

a.  The  true  values  of  the  unknovn  parameters  are  eventually 
learned,  at  least  in  the  limit  as  the  number  of  learning 
observations  approaches  infinity.  This  condition  may  be 
considered  to  be  met  if,  as  the  number  of  learning  observations 
approaches  infinity,  the  a  posteriori  density  p(@|A^,  ...  A^) 
approaches  a  Dirac  delta  function  at  the  true  values  of  the 
unknovn  parameters. 

b.  The  form  of  the  learning  process  does  not  change  as  additional 
observations  are  taken.  This  condition  may  be  considered  to 
be  met  if  the  probability  distributions  on  6  are  reproducing 
in  nature--i. e. ,  if  the  a  posteriori  and  a  priori  distribu¬ 
tions  are  of  the  same  form  under  Bayes*  rule.  If  the  distribu¬ 
tions  are  reproducing,  the  learning  process  simply  involves 
computation  of  new  parameters  for  the  densities  at  each  stage 
of  the  process,  neither  the  number  nor  the  type  of  computations 
changing. 

2.  Condition  (a)  is  fulfilled  if  it  is  possible  to  compute  the  true 
value  of  6  from  an  infinite  sequence  of  learning  observations; 
and  this  true  value  is  not  ruled  out  by  p(0),  the  a  priori 
probability  distribution  assumed  for  0.  It  is  shown  in  the 
report  that  these  conditions  are  met  by  most  probability  distribu¬ 
tions  of  practical  slgnif icance,  even  by  some  distributions  of 
such  form  that  condition  (b)  cannot  be  met.  Thus,  the  learning 
process  developed  in  this  report  should  be  valid  for  most 
practical  cases,  provided  condition  (b)  is  also  fulfilled. 

3.,  In  order  to  determine  whether  the  a  priori  p(0)  assumed  is 

reproducing  or  not  [condition  (b) ]  a  technique  has  been  developed 
whereby  the  expression  for  the  a  posteriori  density  is  factorized 
as  follows: 


p(e|A1,  ...  An)  =  p(0 | 


A  ) 
n/ 


p.(el 


b[  p(©) Ia^, 


(19) 
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wherein 


=  "experimental  portion"  of  a  posteriori 
density  (depends  only  on  the 
observations ),. 

E[p(0) Ja^  ...  A  ]  =  statistical  expectation  of  p(0)  taken 

with  respect  to  p(0|A^,  ...  A^). 

The  likelihood  function  p(A^,  •••  A^|@)  used  to  generate 
P  (  0  | ; ,  ...  A^)  is  assumed  to  be  an  integrable,  non-negative 
function  of  0;  the  "experimental  portion"  of  the  a  posteriori 
density  is  a  normalized  version  of  the  likelihood. 

4.  It  is  shown  in  the  report  that  (at  least  after  a  large  number  of 

learning  observations)  the  behavior  of  the  a  posteriori  density 
p(0  ...  Aq)  is  primarily  determined  by  the  "experimental 

portion"  p(0 |a^,  ...  A  ),  see  Eq.  (19)  above.  Conditions  for 
the  "experimental  portion"  to  be  reproducing  are  analyzed  in  the 
report.  It  is  shorn  that  the  "expei imental  portion"  of  the 

a  posteriori  density  is  reproducing  if  and  only  if  the  learning 
observations  are  such  that  a  sufficient  statistic  for  0  of  fixed 
dimension  exists. 

5.  It  is  possible  to  find  an  a  priori  p(0)  that  is  reproducing  if 
and  only  if  the  "experimental  portion"  of  the  a  posteriori  density 
is  reproducing^  i.e.,  if  and  only  if  a  sufficient  statistic  for  0 
of  fixed  dimension  exists.  Any  reproducing  p(0)  that  exists  may 
be  generated  by  multiplying  a  function  of  the  form  of  the  likelihood 
p('V  •••  A  |@)  by  an  arbitrary  non-negative  function  of  0  and 
then  normalizing. 

6.  If  a  sufficient  statistic  for  0  of  fixed  dimension  exists,  the 

a  posteriori  densities  p ( 0 |  ) ,  p(0|A^,Ag),  ...  become  repro¬ 

ducing  after  the  first  observation  has  been  utilized  (occasionally 
after  the  first  few  observations  have  been  utilized).  Hence,  if 
there  is  no  objection  to  one  reprogramming  of  the  learning  system 
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after  the  first  set  of  learning  observations,  the  learning  tech¬ 
niques  described  herein  can  be  applied  regardless  of  vhat  a  priori 
density  p(0)  is  used,  provided  a  sufficient  statistic  of  fixed 
dimension  exists. 

7*  Since  this  is  the  case,  the  use  of  reproducing -type  a  priori 
densities  may  in  many  cases  afford  little  if  any  simplification 
in  the  computations  involved.  Non- reproducing  densities  might  be 
preferred  if  they  resulted  in  a  faster  rate  of  convergence  to  a 
delta  function  of  the  a  posteriori  probability  densities.  It 
is  shown,  however,  that  little  if  any  increase  in  rate  of  conver¬ 
gence  can  be  obtained  by  using  non-reproducing  densities,  if  the 
a  priori  densities  are  approximately  the  same  width. 

8.  The  results  can  be  generalized  to  apply  to  the  case  where  the 

learning  observations  ...  A  are  not  conditionally  independent 

given  0;  however,  in  this  case  the  form  of  the  learning  system  may 
depend  on  the  state  of  the  system  derived  from  the  previous 
observations. 


D.  EXAMPLES  OF  REPRODUCING -TYPE  DENSITIES 


1.  Two  classes  of  reproducing -type  densities  are  considered: 

a.  Simple  reproducing -type  densities  are  densities  identical  in  form 

with  the  "experimental  portion”  of  the  a  posteriori  density. 
Such  densities  may  be  generated  by  picking  the  "a  priori 
observations”  ...  A^^,  then  normalizing  the  likeli¬ 

hood  for  these  observations  as  in  Eq.  (20)  above. 

b.  Composite  reproducing -type  densities  are  simple  reproducing- 
type  densities  multiplied  by  another  function  of  0  and  then 
normalized;  i.e.,  composite  reproducing -type  densities  are 
of  the  form 


p(e)  =  - 


p(e|A_m,  •••  a0)  r(e) 


(30) 


J  P(0IA_m>  •••  A0)  r(e)  d0 

vhere  r(e)  is  a  non-negative,  integrable  function  of  9. 
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2.  Tables  1  through  5  list  a  number  of  simple  reproducing  ';ype 
probability  densities.,  with  many  of  their  parameters  and  proper¬ 
ties.  Methods  of  utilizing  the  densities  to  represent  a  priori 
knowledge  are  discussed;  the  limiting  forms  of  the  densities  as 
the  number  of  observations  becomes  very  small  or  very  large  are 
also  given. 

3.  Two  important  classes  of  composite  reproducing -type  densities  are 
discussed.  The  first  class  is  applicable  when  the  parameter  9 

is  known  to  lie  within  a  certain  range,  but  no  parts  of  this  range 
are  to  be  preferred  over  others.  The  second  class  arises  from 
the  possibility  of  choosing  r(8)  to  convert  an  unfamiliar 
probability  density  into  a  more  familiar  form.  Numerous  other 
types  of  composite  reproducing -type  densities  are  possible. 


E.  APPLICATIONS 

1.  As  long  as  a  sufficient  statistic  of  fixed  dimension  exists,  the 
techniques  herein  developed  are  applicable  to  a  wide  variety  of 
problems  such  as  pattern  recognition  with  incomplete  knowledge  of 
the  statistics  involved,  finding  a  probability  density  that  simpli¬ 
fies  taking  the  expectation  of  a  non-negative  random  variable,  or 
estimating  a  parameter  when  no  a  priori  information  is  available. 
The  problems  include  some  for  which  the  learning  model  developed 

in  this  paper  is  not  applicable. 

2.  The  chief  requirement  for  application  of  the  technique  is  the 
existence  of  a  sufficient  statistic  of  fixed  dimension.  Dynkin 
[Ref.  19]  has  made  a  general  study  of  the  conditions  under  which 
sufficient  statistics  of  fixed  dimension  exist,  and  of  methods  for 
finding  them.  Sufficient  statistics  of  fixed  dimension  appear  to 
exist  for  most  of  the  simpler  probability  lavs  normally  encountered, 
and  for  some  of  the  more  complex  ones. 
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IX.  RECOMMENDATIONS  FOR  FURTHER  WORK 


Although  the  results  of  this  investigation  give  solutions  to  a 
number  of  problems  in  the  field  of  machine  learning,  they  open  up  a 
number  of  new  problems.  These  problems  include  finding  methods  for 
extending  the  present  theory  and  finding  methods  for  tying  the  present 
theory  in  with  other  results  in  the  machine -learning  area.  Some  of 
these  problems  are  indicated  below. 

A.  PROBLEMS  SUGGESTED 

1.  Procedure  When  Sufficient  Statistics  do  not  Exist 

Much  of  the  work  on  the  theory  of  communication  systems  involves 
analyzing  complex  systems.  The  probability  lavs  encountered  in  studying 
the  more  complex  systems  (and  some  of  the  simpler  ones)  are  often  of 
forms  for  which  no  simple,  sufficient  statistics  exist.  In  these  cases 
the  theory  developed  in  this  paper  is  not  directly  applicable. 

One  of  the  chief  problems  to  be  investigated  is  finding  hov  to  pro¬ 
ceed  when  no  simple,  sufficient  statistic  exists.  A  possible  approach 
would  be  to  use  a  statistic  that  is  not  sufficient,  but  that  is  of 
fixed  dimension  and  in  some  sense  "efficient."  If  this  approach  is  to  be 
used,  some  method  of  comparing  possible  statistics  is  needed.  A  cri¬ 
terion  might  be  based  on  Kullback’s  information  integral  or  divergence 
[Refs.  22,  23 ],  which  are  maximum  if  and  only  if  based  on  a  sufficient 
statistic. 

2.  Effect  of  Taking  Expectation  of  Performance  Criterion 

The  analysis  herein  has  been  confined  almost  exclusively  to  the 
computation  of  the  probability  densities  p(@|A^,  ...  An).  In  actual 
applications,  these  probability  densities  would  normally  be  used  to 
take  the  expectation  of  some  random  variable  (see  the  final  stage  in 
Fig.  l).  The  forms  that  this  final  stage  of  the  computation  might  take 
and  the  effects  of  these  forms  on  the  learning  process  should  be 
investigated.  The  chief  result  along  these  lines  in  this  investigation 
is  the  proof  that  the  limiting  form  for  the  total  computation  is  the 
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form  that  would  have  been,  obtained  if  the  unknown  parameters  had  been 
known  (Corollary,  Theorem  I,  Chapter  IV). 

3.  Rate  of  Convergence 

Little  work  has  been  done  on  investigating  the  rate  at  which  the 
probability  densities  converge  to  their  limiting  (delta  function)  form. 
Since  it  has  been  shown  that  the  convergence  properties  are  determined 
largely  by  the  ’’experimental  portion”  of  the  a  posteriori  density, 
and  since  this  portion  of  the  density  is  a  normalized  likelihood,  some 
of  the  techniques  employed  in  the  study  of  maximum  likelihood  estimates 
may  be  useful  here. 

4.  Applications 

The  material  presented  in  this  paper  has  only  begun  to  scratch 
the  surface  of  the  possible  applications  of  the  techniques  that  have 
been  developed.  The  problem  has  been  formulated  in  a  general  enough 
manner  to  indicate  that  there  is  a  wide  variety  of  possible  applications 
however,  a  great  deal  of  work  on  specific  applications  remains  to  be 
done. 

5*  Information -Theory  Properties 

The  probability  densities  examined  in  this  paper  appear  to  have 
some  interesting  informat ion -theory  properties.  These  aspects  have  not 
been  investigated  as  yet.  It  may  be  possible  to  tie  the  theory 
developed  in  this  paper  in  with  some  models  for  learning  processes  that 
are  based  on  such  information-theory  concepts  as  entropy  [Refs.  24,  25]. 

6.  Effects  of  Errors 

If  an  error  is  made  in  the  type  of  likelihood  function,  p(A^|0) 
assumed,  the  results  are  unpredictable.  (This  does  not  contradict  the 
proof  that  the  limiting  form  of  the  a  posteriori  density  is  independ¬ 
ent  of  the  a  priori  density,  as  in  this  case  p(A^|@)  was  not  in 
error.  )  The  form  that  the  a  posteriori  density  will  take  in  the 
limit  can  be  predicted  in  any  particular  case.  For  example,  if  it  were 
assumed  that  the  observations  were  generated  by  a  one -dimensional 
Gaussian  process  with  the  density  having  known  variance  and  unknown 
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mean,  whereas  the  input  observations  were  actually  generated  by  an 
exponential  process,  the  sample  average  would  be  used  as  an  estimate 
of  the  mean  while  this  sample  average  was  actually  converging  to 
1/Aq  (see  Tables  2  and  4).  How  accurately  the  resulting  probability 
distribution  would  fit  the  data  is  not  clear.  This  question  would  be 
worth  investigating,  as  would  a  more  general  analysis  of  the  effects 
of  errors. 

7.  Several  Possible  Likelihood  Functions 

In  certain  cases  it  might  be  known  that  the  likelihood  function 
took  one  of  several  possible  forms,  such  as  Gaussian,  Rayleigh,  or 
exponential,  but  the  precise  one  of  these  forms  applicable  might  not  be 
known.  In  such  cases  an  approach  assuming  a  number  of  possible  forms 
for  the  likelihood  function  is  possible,  weighting  each  of  these 
hypotheses  by  a  factor  similar  to  Watanabe’s  credibility  measure  [Ref. 
26],  and  adjusting  the  weights  as  observations  are  taken  may  be  feasible. 
A  similar  problem  has  been  investigated  by  Magill  [Ref.  27]  in  developing 
techniques  to  predict  which  of  a  known  set  of  possible  Gaussian  signals 
is  being  observed,  and  at  the  same  time  predict  the  value  of  the  signal. 


B.  SUMMARY 

In  summary,  a  fairly  general  theory  has  been  developed,  which  appears 
to  have  vide  applicability;  however,  much  additional  work  on  extending 
the  theory,  tying  it  in  with  other  theories,  and  applying  it  to  specific 
cases  remains  to  be  done. 
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APPENDIX 

DETAILED  COMPUTING  PROCEDURES 


This  appendix  describes  the  detailed  procedures  used  in  computing 
the  densities,  limits,  and  so  on  in  Tables  1  through  5:  it  includes 
also  a  special  computation  for  the  expectation  of  a  cosine  of  unknown 
frequency  for  Chapter  VII. 

A.  COMPUTATION  OF  REPRODUCING  DENSITIES 

It  is  desired  to  compute  the  forms  of  the  simple  reproducing  densi¬ 
ties  listed  in  Table  2,  plus  the  simple  reproducing  density  for  the 
Gaussian  case  with  both  M  and  K  unknown. 

The  first  density,  the  beta  density  for  learning  P  for  a  binomial 
distribution,  was  computed  in  the  main  text.  The  computation  simply 
involves  normalizing  the  likelihood  function  in  the  first  column  of 
Table  2.  This  can  be  done  either  by  integration  or  by  comparing  with 
standard  densities  as  discussed  in  the  text.  A  similar  procedure  is 
followed  in  all  the  cases  in  Table  2. 

The  second  and  third  densities  in  Table  2  are  generalizations  of 
the  first  and  need  no  discussion.  The  derivation  of  the  fourth,  a  gamma 
density  for  learning  the  parameter  a  for  a  Poisson  distribution,  is 
also  straightforward.  It  is  simplified  slightly  if  the  likelihood  is 
rewritten  as 


p(n,T|a)  =  K(n,x)  an  e  aT 


(A.l) 


and  only  the  part  depending  on  a  is  considered  in  normalizing. 

The  fifth  density,  Gaussian  for  learning  a  Gaussian  mean,  is  derived 
in  a  similar  manner.  The  computation  is  simplified  by  completing  the 
square  in  the  exponent  of  the  likelihood,  using 


^  (  X1-  M  )t  K*1(  X±-H  )  =  n(  Sn-M  )t  K*1(  Xn‘M  ) 

(A-2) 
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The  likelihood  is  then  rewritten  as 

p(Xr  ...  Xn|M  )  =  K(x  J,  ...  Xn)  exp  |(xn-  M  ^K^Cx^  M  )J 

(A.  3) 

proceeding  thereafter  as  in  the  Poisson  case. 

The  Wishart  density  for  learning  an  unknown  covariance  matrix 
(Case  6)  is  derived  in  a  similar  manner,  utilizing  the  identity 

trVnK_1=^  (X1-M  )tK'1(  X1-M  )  (A. 4) 

i=l 

to  show  that  the  two  forms  of  the  likelihood  in  the  fifth  and  sixth 

cases  of  Table  2  are  equivalent.  In  this  case,  comparing  the  manner  in 

which  the  likelihood  depends  on  K  1  with  the  manner  in  which  the 

Wishart  density  depends  on  is  much  simpler  than  integration  as 

a  method  of  obtaining  the  normalizing  constant.  See  Chapter  VI,  Section 
A  for  a  discussion  of  this  procedure*. 

If  both  M  and  K*1  are  unknown,  p(X^>  •••Xn|M  y  K 
is  rewritten  as 

p(  xx»  .  •  •  Xn  |  M  >  K_1)  =  [ (2n)d  I  K  1 3  ^exp  -  |  tr  V* 

•  t(2x)d|K  |  ]*1/2  exp  £  |(Xn-  M  )t*  *  )J 

(A. 5) 

with 

n 

V  n  "Z  (xi-*n)(Xi-Xn)t,  (A.6) 

i-1 

and  the  other  terms  defined  as  before. 

The  second  factor  in  Eq.  (A. 5)  depends  on  its  parameter  in  the 
manner  in  which  a  Gaussian  density  depends  on  its  argument,  while  the 
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first  factor  depends  on  its  parameter  in  the  manner  of  a  Wishart  density. 
This  suggests  as  a  normalized  density* 


$(M 


(n+d)/2 


|d(n+d) 

2 


-(n-l)/2 
|  K  |  exp 


nd(d-i)A 


d-l 

n 

a=0 


•  ((2«)d|Knl)  '1//2exp  \  (M  -Xn)tK^(M  -Xx)  j  (A.7) 

The  normalization  in  Ea.  (A.7)  can  be  checked  by  integrating  first 

over  M  9  then  over  K  The  first  integration  gives  a  Wishart 

density  as  a  marginal  density;  the  integral  of  this  Wishart  density  is 

then  unity  as  it  should  be. 

* 

V  is  the  only  parameter  that  has  been  encountered  in  a  simple 
reproducing- type  density  for  which  a  recurrence  relation  for  computing 
the  new  value  of  the  parameter  from  its  old  value  and  the  learning 
observations  is  not  obvious.  A  simple  recurrence  relation  exists, 
however,  as  follows: 


Vn-1  +  n  ^  Xn  '  X  n-l^  Xn  ‘  Xn-lV 


(A.8) 


To  derive  the  density  for  learning  the  magnitude  and  phase  of  a 
complex  Gaussian  mean  (Case  7)>  the  portion  of  the  exponent  in  the 
likelihood  depending  on  a  and  0  is  first  rewritten  as  follows: 

-2a^|Xi|  cos  (0  ■ HX i)  +  ^a2  -  -2na|Xkl|  cos  (0  +  b^)  +  na*"  (A. 9) 

with  lx  I  and  b  defined  in  Table  2.  The  normalization  is  then 
1  n  *  n 

accomplished  by  computing 


-£ 

The  density  in  Eq.  (A.7)  is  not  included  in  Table  2.  It  is  the 
simple  reproducing- type  density  for  learning  both  M  and 
and  is  the  density  utilized  by  Keehn  for  this  purpose  [Ref,  10].. 
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(A. 10) 


2a|}^|  cos  (0 


utilizing  some  of  the  properties  of  Bessel  functions  [Ref.  28]. 

The  final  three  cases  in  Table  2  are  straightforward.  The  densities 
for  the  exponential  and  Rayleigh  cases  may  be  normalized  by  comparing 
with  the  gamma  density;  but  the  density  for  the  rectangular  distribution 
must  be  normalized  by  integration. 


B.  COMPUTATION  OF  MOMENTS 

The  moments  given  in  Table  3  were  arrived  at  as  follows:  beta, 
Dirichlet,  gamma,  Gaussian,  and  Wishart  densities  are  standard  forms 
with  moments  already  tabulated  [Refs.  29  -  3^ ] •  Hence,  in  this  appendix 
it  is  merely  necessary  to  compute  the  means  and  the  variances  for  the 
two  cases  (Cases  7  and  10 )  where  the  simple  reproducing -type  densities 
are  not  standard  forms. 

The  expectation  of  a,  the  magnitude  of  the  complex  mean  of  a 
Gaussian  density  (Case  7)>  is  given  by 


E[a] 


Jt  00 


a  exp 


-TC  0 


2cr 


-  2a lx  cos  (0 

*  n 


d0  da 


(A. 11 ) 


It  is  known  that  the  integral  of  the  Rician  density  is  unity,  or 


exp[-  |Xn|2/2^J 


Jt  00 


2  n  <T 


0 


a2-2a|X 


cos  (0 
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Comparing  Eqs .  (A.  12)  and  (A.  11)  it  is  found  that:. 


E[a] 


-(l)1/2"„ [i^i2/^]1;1  [i\i 


(A.13) 


To  obtain  the  variance;  the  same  procedure  is  followed;  using  the 
fact  [Ref.  32]  that  the  first  moment  of  the  Rician  density  is  given  by 


it  00 


II 

-n  0 


2 

a  exp 


“ 2  &2  *  2alXnl  cos  W  '  +  lXn!2 


6.0  da 


(i) 7  exp  [i\i  2/1"2 


,i  i  (rx«i 2 

11  +  _2  )  I0  (  ,J 


2cr 


n 


,  lEsf 

*  S  'll  _2 


2cr 


n 


W 


n 


(A.  14) 


to  obtain 


2  lxn  |2 

Eta2]  =  -§— 


1  + 


jilf 

w2 


n  -i 


ix„r 


4cr 


+  <j 


(A. 15) 


Subtracting  E  [a]  gives  the  tabulated  variance. 

Integrating  the  expression  for  p(a,0)  over  a  gives 


exp  {[-  !Xa|2A<j2  ]  [ 

1  -  2  cos2  (0  -  &n)j  } 

[XH 

4 a*2 

»  n 

P(0)  =< 

2«I0 

/ - 

O 

otherwise. 

1+erf 


|X  I  cos  (0-6  ) 
i  n  1  n 


-it  <  0  <  n 

(A. 16) 


SEL-63-099 


-  9k  - 


with  erf(x)  =  the  error  function.  No  closed-form  expressions  exist  for 
the  moments  of  this  density,,  so  efforts  are  confined  to  finding  large- 
and  small-sample  equations.  First,  however,  the  mean  and  variance  must 
be  computed  for  the  final  simple  reproducing -type  density,  the  density 
for  learning  W  for  a  rectangular  distribution  (Case  10 ). 

E[W]  is  found  by  straightforward  integration 

/  M  V  n'1 

(n-1 ){~)  dW,  n  >  1 

In„  n  >  2, 

1  <  n  <  2  (A. 17) 

/  M  \  n"2 

!)Mn  (£)  dW,  n>l 

(A.  18) 

1  <  n  <  2, 

C.  LARGE-SAMPLE  LIMITS  OF  MOMENTS 

The  limiting  forms  of  many  of  the  parameters  in  Table  b  may  be  ob¬ 
tained  by  the  simple  algebraic  process  of  letting  the  size  of  the  set 
of  observations  grow  without  bound,  then  computing  the  limits  obtained. 
This  process  gives  all  of  the  values  tabulated  as  zero  in  Table  4. 

The  limiting  forms  of  most  of  the  remainder  of  the  parameters  follow 
directly  from  application  of  the  strong  law  of  large  numbers  if  the  limits 


n"1  M 

n-3  Mn’ 


n  >  3; 


1  <  n  <  3 


Subtracting  E  [W]  gives  Var  [W]  except  for  the  case 
which  is  of  the  form  «j  -  °o  and  hence  undefined. 
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are  determined  by  actual  observations.  For  the  binomial  distribution, 
Case  1: 


Hence,  by  the  strong  lav  of  large  numbers 


r 

n 


P 


o 


(A. 19) 


(A. 20) 


with  probability  one. 

Similar  reasoning  applies  in  most  of  the  other  cases  studied.  In 
case  of  the  multinomial  distribution,  Case  2: 


E 


io 


For  the  binary  Markov  Process,  Case  3: 


For  the  Poisson  process,  Case  4: 

E  -  la  *  a  =  a 

\j  oj  o 

For  the  Gaussian  process  vith  unknown  mean  vector,  Case  5*» 


(A. 21) 


(A. 22) 


(A. 23) 


E[(  X  ) . |m.  =  ra.  ]  =  m 
L  x  n'l 1  i  io  io 

or  vith  unknown  covariance  matrix,  Case  6: 


For  the  complex  Gaussian  process,  Case  7: 

E[ |X  I  a  =  a  ]  =  a 
L  1  n 1  o J  o 


(A. 24) 


(A.  25) 


(A. 26) 
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and 


E[6  \0  =  0  1  =  0 
n 1  *0  o 


(A. 27) 


For  the  Rayleigh  process,  Case  8: 


'zx2 

1 


2n 


2  2 
cr  =  tx 
o 


]■ 


(A. 28) 


hence,  the  reciprocal  parameter  p  converges  to  l/o*^.  For  the 
exponential  process,  Case  9* 


"EX. 

E 

1 

n 

!  x  =  \ 

°J 

=  UK 


(A. 29) 


with  the  same  type  of  reciprocal  relationship  as  found  in  the  corresponding 
case,  Case  9>  in  Table  3- 

In  each  of  these  cases,  the  strong  lav  of  large  numbers  applies  in 
the  same  manner  as  in  the  binomial  case.  The  only  case  differing  is 
Case  10,  the  rectangular  distribution.  Convergence  can  be  proved  in 
this  case  also,  but  the  proof  differs  from  that  for  the  other  cases. 

Since  in  Case  10  the  sequence  of 
it  must  have  a  limit,  with  probability  one. 


M's  is  bounded  and  monotone, 
n  ’ 


if  the  latter  is  the  true  value  of  W, 


W  it  would  have  to  be  less  than 
o 


W 


This  limit  must  be  W 

o 

since  if  the  limit  were  not 
Then  the  Borel-Cantelli  lemmas 


[Ref.  IS]  would  state  that  values  between  the  limit  and  V/  occurred 

o 

infinitely  often  in  an  infinite  sequence  of  observations,  a  contradic¬ 
tion.  Hence,  must  converge  to  with  probability  one. 

The  limiting  forms  for  means  and  variances  in  all  cases  save  the 
complex  Gaussian,  Case  T,  then  follow  immediately  from  Table  3*  For 
the  complex  Gaussian  density  the  limiting  forms  of  the  moments  for  a 
follow  from  expansion  of  the  Bessel  function  terms,  using  the  usual 
asymptotic  expansions  valid  for  large  arguments  [Ref.  28].  The  moments 
of  0  follow  from  the  limiting  form  of  p(0): 


P(0) 


X 


«/*2n 


exp 


-I*/ 


sin  2(0-b  ) 
n 


/2  o-2  b  -rt  <  0  < 


(A.  30) 
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Since  o*^  -»  0,  Expression  (A.  30 )  approaches  zero  except  for  0  ~  5^. 
Hence,  E[0]  -*  &  .  The  order  of  magnitude  of  the  variance  can  be  esti¬ 
mated  from  the  width  of  the  pulse  given  by  Expression  (A.  30).  This  is 

2 

obviously  of  the  order  of  cr^.  The  variance  is  a  measure  of  the  width 
of  the  pulse  and  must  be  of  the  same  order  of  magnitude.  The  limiting 
form  of  the  covariance  is  at  most  of  the  maximum  order  of  the  variances. 

D.  SMALL-SAMPLE.  LIMITS  OF  MOMENTS 

The  values  of  all  limits  In  Table  5>  save  for  the  complex  Gaussian 
case,  phase  variations,  are  obtained  immediately  from  taking  limits  in 
Table  3*  The  moments  for  uniform  densities  may  be  found  tabulated  in 
the  cases  where  the  parameter  range  is  finite.  If  the  parameter  range 
is  infinite,  and  a  function  of  G  is  unbounded  and  non-negative,  the 
limiting  value  of  the  expectation  of  the  function,  as  the  density  on 
Q  approaches  uniformity,  is  Infinite;  while  if  the  function  can  be 
both  positive  and  negative,  the  limiting  expectation  is  undefined.. 

This  gives  all  values  in  Table  5  save  for  the  moments  of  0  in  the 
seventh  case. 

For  these  moments  of  0  it  is  merely  necessary  to  evaluate  the 

2 

expression  for  p(0)  in  Eq.  (A.l6)  as  n  approaches  zero  and 
approaches  infinity.  The  limit  is  a  uniform  density  over  the  range 

-  Jt  <0  <  rt. 


E.  LIKELIHOOD  FOR  COSINE  OF  UNKNOWN  FREQUENCY 


Section  B  of  Chapter  VII  applied  the  learning  technique  to  finding 
the  expectation  of  a  random  variable--specifically  a  likelihood  ratio 
involving  a  cosine  of  unknown  frequency.  It  was  necessary  to  integrate 
Eq.  (83)  twice  to  obtain  Eq.  (84).  Since  p(0)  is  uniform  over  the 
range  [0, 2jt]: 


i(x|a,f) 


X(t)  cos  (cut  +  Q)  dt 


SEL-63-099 


-  98  - 


Expanding  the  cosine  term, 


X(t )  cos  (<ot+0)dt  =  cos  0  I  X(t)  cos  aftdt  -  sin  0  /  X(t)  sin  cutdt 


(A. 32) 


Hence 


i(x|a,f)  =  e  Ir 


- 

T 

1 

4a 

i r 

/  x(t)  eia3fcdt 

0 

u 

0 

_ 

(A. 33) 


Then,  since  by  hypothesis,  a  is  Rayleigh -distributed  with 
parameter  A^: 


f°  -a2/2N  B2 

i(x|f)=  “p  e  °  In 

-=t 

lev 

cdl 

1 _ 

T! 

/  X(t)  eiUJtdt 

r 

aj 

oc 

> 

( 

LNoB  1 

0 

J 

(A-  3k) 


wherein  use  is  made  of  the  fact  that  the  integral  of  the  Rician  density 
is  unity  in  a  manner  analogous  to  Section  B  of  this  Appendix. 
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i  Attn:  W.  E.  Bradley 

Defense  Communicati  >:.s  Agency 
Dept,  of  Defense 
Wasiiingtor  ,  D,  C, 

L  Attn:  Code  L'LA,  Tech.  Library 

Advisory  Group  on  Electron  Devices 
j4t>  Broadway,  Jth  Floor  East 
Hew  York  13*  N.  Y. 

J  Attn:  II.  Gull  Ivan 

Advisory  Group  on  Reliability  of 
Electronic  Equipment 
Office  Asst,  Cecy.  of  Defense 
‘]  r.e  ieidagor 
J  Wn:  id’  .-on  .  D.C. 

Cotranandi'g  Officer 
Diamond  Ord:  at  re  Fuse  Libs 
Vta..'hingtot  P‘<,  D.  C. 

Att-  :  ORDTL  930,  Dr.  K.  T.  Young 

Di  *  itflrj  j  id  O  rd  n  a  i  ce  Fu so  Lab . 

U.3.  Orb  :/*e  Corps 
Warhingto*  D..  C, 

1  Attn:  ORDTL-  vjOmCyH, 

Mr.  R,  H.  Corny t. 

II. 3.  Dept,  of  Commerce 
national  Bure-u  of  Standard:: 
Boulder  Labs 

Centra.  Radio  Propagation  Lab. 

•t  It.  •!  ier,  Co -or ado 
’  Attn:  Miss  J.  V.  Lincoln,  Cnief 
RWT'.r. 

DJF,  E;  gineering  Section 
1  Warning  o’’,  D.C. 

Infomatior  Retrieval  Section 
Federal  Avi*  U  Agency 
Wasnlngtou,  n,  C. 
l  Att;  :  MS- 11.  ,  juibrur.y  Branch 

i)DC 

Camera-  St  at  lot. 

Alcoa; dr ia  4,  Vn. 

C  A  tin:  TISIA 

U.G.  Coast  Guard 
1500  E.  Street,  L.W. 

Washington  D.  C. 

1  Att;  :  EEE  Stutio:  ‘>-1 

Office  of  Technical  Services 
Dept,  of  Commerce 
1  Wa.  •  I  ng'.or  L.C. 

Director 

Ualio  .al  Security  Agency 
Fort  George  G.  Meade,  Mi. 
j.  Atti  ;  R4,. 


NASA,  Godin rd  Space  Flignt  Cer:ter 
Greej.belt,  Mi, 

i  Att  :  Code  nil,  Dr.  G.  H.  Ludwig 
1  Chief,  Data  Systems  DlvirLoin 

Chief,  (J.fl.  Amy  Security  Agency 
Arlington  iiali  Station 
Ari  Ing1  o.n  1- ,  Virginia 


* 

No  AF  or  Cia:  c  i fled  Reports. 


GCHOOIC 


*U  of  Aberdeen 
Dept,  of  natural  Philosophy 
Marlschul  College 
Aberdeen,  Scotian t 
1  Attn:  Mr.  R.  V.  Jones 

U  of  Arizona 
EE  Dept. 

Tucson,  Aria. 

1  Attn:  R.  L.  Walker 
1  D,  J,  Hamilton 

*U  of  British  Columbia 
Vancouver  6,  Canada 
1  Attn:  Dr,  A.  C.  Soudack 

California  Institute  of  Technology 
Pasadena,  Calif, 

1  Attn:  Prof.  R.  W.  Gould 
1  Prof.  L.  M.  Field,  EE  Dept. 

1  D.  Bravorman,  EE  Dept. 

California  Institute  of  Technology 
4600  Oak  Grove  Drive 
Pasadena  3,  Calif. 

1  Attn:  Library,  Jet  Propulsion  Lab. 

U.  of  California 
Berkeley  4,  Calif. 

1  Attn:  Prof.  R.  M.  Saunders,  EE  Dept. 
1  Dr.  R.  K.  Wake r ling, 

Radiation  Lab.  Info.  Div., 
Bldg.  ;;0,  Rm.  101 

U  of  California 
Los  Angeles  .’4,  Calif. 

1  Attn:  C.  T.  Leondeo,  Prof,  of 
Engineering,  Er.gineeri;.g 
Department 

1  R.  S.  Elliott, 

Electromagnetics  Div.,,  College 
of  Engineering 

U  of  California,  San  Diego 
School  of  Science  and  Engineering 
La  Jolla,  Calif. 

1  Attn:  Physics  Dept, 

Carr.egie  Institute  of  Technology 
Schetiley  Park 
Pittsburg  13,  Pa. 

1  Attn:  Dr.  E.  M.  Williams,  EF.  Dept. 

Case  Institute  of  Technology 
Engineering  Design  Center 
Cleveland  u,  Ohio 

1  Attn:  Dr.  J.  B.  Neswick,  Director 
Cornell  U 

Cognitive  Systems  Research  Program 
Ithaca,  Ti.  Y. 

1  Attn:  F.  Rosenblatt,  Hollister  Hall 


Drexel  I; ctitule  of  Technology 
Philadelphia  4,  pa. 

1  Attn:  F.  E.  Haynes,  EE  Dept. 

U  of  Florida 

Engineering  Bldg.,  Rn,  33k 
Geinsville,  Fla. 

1  Attn:  M.  J.  Wiggins,  EE  Dept. 

Georgia  It stitute  of  Technology 
Atlanta  1_\,  Ga. 

1  Attn:  Mrs.  J.  H,  Crosland 
Librarian 

1  F.  Dixor.,  Er.gr.  Experiment 

Static*, 


Harvard  U 
Pierce  Hall 
Cambridge  Jo,  Mann. 

1  Attn:  Dean  H.  Brooks,  Div  of  Engr. 

and  Applied  Physics,  Rm.  t‘l'( 
J  E.  Farkr.s,  Librarian,  Rm. 

303A,  To cii.  Reports 
Collection 

U  of  HnwalL 
Honolulu  14,  Hawaii 
1  Attn:  Arot.  Prof.  K.  Itajitn, 

EE  Dept. 

Illinois  Institute  of  Ter! .  of » *,v 
Technology  Cor. ter 
Chicago  i6,  Ill. 

1  Attn:  Dr.  P.  C.  Yuen,  EE  Dept, 

U  of  Illinois 
Urbane.,  Ill. 

1  Attn:  P.  D.  CoLerar.!,  EE  Lab, 

1  W.  Perkin.’,  EE  lies.  Lab. 

1  A.  Albert,  Tech.  Ed,,  EE 

Res.  Lab. 

1  Library  Serials  Dept. 

1  Prof.  D.  Alpert,.  Coordinated 

3c i.  Lab. 


*Ii.stituto  it*  Pesquisas  l"  Mari-  ha 
Minis tcrlo  da  Marinh 
Rio  de  Janeiro 
Eetado  da  Guai.obaru,  Brasil 
1  Attn:  Roberto  B,  da  Costa 

Jot  ms  Hopkins  U 
Charles  aid  J4th  St. 

Baltimore  13,  Md. 

1  Attn:  Librarian,  Carlyle  Barter  Lab. 

Joans  Hopkins  U 
oLSl  Georgl?  Avt*. 

Silver  3pr:*  Ml. 

1  Attn:  it.  Jnoksy 
1  Mr.  A.  w.  irry,  Anglic .1 

Pn  L<n;  jAb. 

Lit.field  Research  I*  .'t  Itui.e 
McMii.nvlllo,  Ore. 

1  Attr  :  G.  4.  !i<*kuk,  Dire 'tor 

Ma*-qne**.  Lvetv:\, 

College  of  Engx'.our i 
11.1.'  W.  WlSC0..3i!  A  Vo. 

Milwaukee  >,  Wir, 

1  Att;:  A.  C.  Muoilcr,  Ei  Dei  t. 

M  I  T 

Cambridge  Mas.  . 

1  Attn:  her.  Lib.  of  Elec.,  Doc. 

Rm. 

1  Ml,;:-  A.  Oils,  Libn.  Rm  4-. '44, 

LIR 

1  Mr.  J.  E,  Ward,  Elec.  Gy;: . 

Lab. 


MIT 

Li r.co] :  La b  ’  r-  •  t*  try 
P.0.  Box  73 

1  Attn:  Le/ingto  .  (J,  Mas:; . 

1  Hnv/  Representative 

1  Dr.  W.  I.  Wells 

U  of  Mi  •  .igr*. 

Ar.r  Arbor,  Mi*’!;. 

1  Att-:  Dir,,  Cooler  iiee.  Labs,, 
it.  Campus 

1  Dr,  J.  E.  Howe,  Elec.  Pay;'. 

Lab. 

1  Comm.  Sci,  Lib.,  loO  Frieso 

Bldg. 
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U  of  Ml  oh  i  gar. 

Institute  of  Science  and  Technology 
P.O.  Box  tl8 
Ai.r.  Arbor,  Mich. 

1  Attr, j  Tech.  Documents  Service 
i  W.  Wolfe— IRIA-- 

U  of  Minnesota 
Institute  of  Technology 
Minneapolis  l4,  Minn. 

1  Att- ;  Prof.  A.  Van  der  Ziel, 

EE  Dept. 

U.  of  Nevada 
Collece  of  Engineer!  ;g 
He:  o,  Nev. 

1  Att-  Dr.  R.  A.  Ma'inrt,  EE  Dept. 

Nurtue’-ster-  U 
The  Dodge  Library 
Bo  to*  1,,  Man  . 

1  Att;  s  Jo, ok:  n.  L  *  ie,  Librarian 

Northwester .  u 
'■  '  Oak  to:  St, 

Eva-  tu-.  Ill. 

1  Att’  :  W.  2,  Totsi,  Aerial 
Men.  a rune  to  Lab. 

U  .r  .1  re  Dane 
S  it'.  Be*  d,  I*  i. 

-  Att;  ;  ...  a  *  :*i,  EL  Dent. 

o  .1  State  n 
r.  Niei  Avi; . 

C<  .  inb't’  10,  oiiij 

i  A*  t  :  Pr  jf.  L.  M.  Buuto,  EE  Doit. 

orej  State  *J 
Cwivailir,  Ore. 

i  Att-  :  H.  J.  Oorthuys,  EE  Dept. 

P  i‘.’te<'it.  1<*  I*  stitute 
.*  :o  .til..-  St. 

Br.  k.-ki  rt ,  3.".  Y. 

1  Att:  :  L.  Safw,  i-)1-  Dept, 

Pul  vice  .  if  I*,  tit. .to  of  Br.o.kJy: 

3r  In*  to  Cc*  ter,  Route  110 
F'-rmi-gl*  Le,  Y. 

1  Att*  :  L’br*  rln- 

P.r lue  U 
Lal'a .-ette,  I  d, 

1  Att*  j  Librar ,,  EL  Dent. 

he  «el*  or  **.ai  •  I;  stitute 

i  ru., ,  y. 

:  Att*:  Llbr-  r,.  Serial.'  Dept. 

uf  Ga.Jkatc-.owu- 
College  of  E-.gi*  eeri.-.g 
Sark-  Ux  >• ,  C*e  *  d*. 

1  Att:  :  Prof.  K.  E.  Lulvig 

St*,  ford  Korean*. a  Institute 
M*"  lo  ParK,  Calif, 
l  Att  :  Enter-,  al  He  ports,  G-037 

Syr-  cure  U 
S.meuce  10,  N.  Y, 

1  ' tt-  :  EE  Deii, 

-III  i*L"ile  U 
I*. 'tltute  of  Puysirs 
Up pc oia.  Swede  - 
1  Att-.:  Dr.  P,  A,  love 

U  of  Utah 

Salt  Lukt.  Cit,,  Utah 
J  Att-.:  K.  W.  Grow,  EE  Dept. 


U  of  Virginia 
Charlottesville,  7a, 

1  Attn:  J.  C,  Wyllie,  Aide  mar. 

Library 

U  of  Wash!*  gto; 

Seattle  7,  Wash. 

1  Attn:  A.  E,  Harriso  EE  Dept. 

Worchester  Polytechnic  Inst. 
Worehester,  Mass. 

1  Attn:  Dr.  H.  H.  Newell 

Yale  U 

New  Haven,  Conn. 

1  Attn:  Sloane  Physics  Lab. 

1  EE  Dept, 

1  Dunham  Lab.,  Eugr.  Library 


INDUS  lb  lEC 

Argo* ■ e  National  Lab. 

9700  South  Ca,  r 

Argc-.i.e,  Ill. 

1  Att;.:  Dr.  0.  C.  Simpson 

Admiral  Corp. 

3-300  Cortla-.d  St. 

Chicago  L7,  in. 

1  Attn:  E.  II.  Roberson,  Librarian 

Airborne  Instruments  Lab. 

Comae  Road 

Deer  Park,  Long  Inlaid,  N .  Y. 

1  Att-:  J,  D  *cr,  Vice-Preo. 

Te~n.  Dir. 

Anperex  Corn. 

S30  Duffy  Ave. 

Hickcville,  Long  Island,  M.  Y. 

1  Attn:  Pro.}.  Enginneer,  S.  Barbacso 

Autonetioc 

Div.  of  North  American  Aviation,  Inc 
91^0  E.  Inperi-tl  Highway 
Downey,  C*ilif. 

1  Attn:  Tech.  Library  j040-3 

Bell  Telepho:  e  Lab.- , 

Murray  Hill  Lab. 

Murray  Hill,  j, 

1  Att*  :  Dr.  J.  }<.  Pierce 
1  Dr.  S.  Darlington 

1  Mr.  A.  J.  Groce  nr.: 

Bell  Telephone  Labs.,  Inc. 

Technical  Information  Library 
Whlppar.y,  JJ.  J. 

1  Attn:  Te-h,  Reptn.  Librn., 

Whippan,-  Lab. 

♦Central  Electronics  Engineering 
Research  Institute 
Pilai.i,  Rajasthan,  India 
1  Attn:  Om  P.  Gandhi  -  Via: 

ONR/Londoi. 

Columbia  Radiatio*:  Lab. 

>3;3  West  120th  Ct, 

1  New  Yurk,  New  York 

Convair  «  San  Diego 

Div.  of  General  D, y-.ami.es  Corp. 

Gar  Diego  IS,  Calif, 

1  Attr  :  E*  gineeri.ng  Library 

Cook  Research  labs. 
t_-Loi  W.  Oaktor.  St. 

1  Attn:  Morton  Grove,  Ill, 


Cornell  Aeronautical  Labs,,  Inc. 

4^55  Genes s e  St. 

Buffalo  21,  N.  Y. 

1  Attr.:  Library 

Eitel-McCullough,  Inc. 

301  Industrial  Way 
San  Carlos,  Calif. 

1  Attn:  Research  Librarian 

Evan  Knight  Corp. 

East  Natick,  Mass. 

1  Attn:  Library 

Fairchild  Semiconductor  Corp. 

‘-tCOl  Junipero  Serro.  Blvd. 

Palo  Alto,  Calif. 

1  Attn:  Dr.  V..  H„  Grinich 

General  Electric  Co, 

Defense  Electro;  les  Div,,  DIED 
Cornell  University,  Ithaca,  N.  Y. 

1  Attn:  Library  -  Via:  Commander, 

A3D  W-P  AFB,  Ohio,  ASRNGW 
D.E,  Lewis 

General  Electric  TWT  Products  Sec. 
<01  California  Ave. 

Palo  Alto,  Calif. 

1  Attn:  Tech,  Library,  C,  G,  Lob 

General  Electric  Co.  Res.  Lab 
P.O.  Box  lOW 
Sc  he  r<e  e tady ,  N . Y . 

I  A*tn:  Dr.  p.  M.  lewis 
1  R.  L.  Shuey,  Mgr.  Info. 

Stud  it?  s  Sec. 

General  Electric  Co. 

Electronics  Park 
Bldg.  3,  Rm.  1., 3-1 
Syracuse,  N.Y. 

i  Attn:  Do-*.  Library.  Y.  Burke 

GllfilLan  Brother:* 

1019  Venice  Blvd. 

Lor  Angeles,  Calif. 

1  Attn:  Engr,  Library 

The  Hallicrufterr  Co. 

7  th  and  Kortner  Ave. 

1  Attr.:  Clilcugo  .  4,  Ill. 

Hewlett-Packard  Co. 

1701  Page  Mill  Koad 
1  Attn:  Palo  Alto,  Calif. 

Hughes  Aircraft 
Malibu  Beach,  Calif, 

1  Attn:  Mr.  Ians 

Hughes  Aircraft  Ct. 

Flore;  ce  at  Teale  St. 

Culver  City,  Calif. 

1  Attn:  Tech.  Doc.  Cen.,  Bldg.  6, 

Rm,  Cl0<k5 

Hughes  Aircraft  Co. 

P.O.  Box  >y{6 
Newport  Beach,  Calif. 

1  Attn:  Library,  Semiconductor  Div. 

IBM,  Box  390,  Boardmar  Road 
Poughkeepsie,  N.  Y. 

1  Attn:  J.  C.  Logue,  Data  Systems  Div. 

IBM,  Poughkeepsie,  N.  Y, 

1  Attn;  Product  Dev,  Lab.,  E.  M. 

Davis 


Cla.  .  ii*L  d  hoj or  n. 
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IBM  ADD  and  Research  Library 
Monterey  and  Cottle  Roads 
Gar.  Joey,  Calif, 

1  Attn;  Miss  M,  Griffin,  Bldg.  025 

ITT  Federal  lAbs. 

500  Washington  Ave, 

Nunley  10,  N.  J. 

1  Attn;  Mr.  E.  Mount,  Librarian 

Laboratory  for  Electronics,  Inc. 
1075  Commonwealth  Ave. 

Botton  15,  Mars, 

1  Attni  Library 

LEL,  Inc. 

75  Akron  St. 

Co plague.  Long  Island,  H,  Y. 

1  Attn;  Mr,  R.  S.  Mmr  '\>.r 

Lenkurt  Electric  Co, 

3a-  Carlos,  Calif. 

1  Attt  :  M.  L.  Waller,  Librarian 

Lib ran nope 

Div,  of  Ge:  eral  Precision,  Inc, 

;06  Wester1  Ave,. 

Glendale  1,  Calif. 

1  Attn:  Er.gr.  Library 

Lockheed  Missiles  and  Space  Div. 
P.0.  Box  -jOhf  Bldg.  W  • 

Sunnyvale,  Calif, 

1  Att-  ;  Dr.  W.  M  Harris,  Dept.  07.30 
1  G.  W,  Price,  Dept,  <>7-33 


Nortroriice 

Palo  Verde e  Research  Park 

6101  Crest  Road 

Paloe  Verdes  Estates,  Calif. 

1  Attn;  Tech.  Info.  Center 

Pacific  Semiconductors,  Inc. 

14^20  So.  Aviation  Blvd. 

Lavndale,  Calif. 

1  Attn;  H,  Q.  North 

Philco  Corp, 

Tech.  Rep,  Division 
P.0.  Box  4730 
Philadelphia  34,  Pa, 

1  Attn:  F.  R.  Sherman,  Mgr.  Editor 

Philco  Corp. 

Jolly  and  Unto:  Meeting  Roads 
Blue  Bell,  Fa. 

1  Attn:  C.  T.  McCoy 
1  Dr.  J,  R#  Feldmeier 

Polarad  Eloctroi  ior  Corp, 

43-20  Thirty -Fourth  St. 

Long  Island  City  1,  N„  Y. 

1  Attn:  A.  I!,  Sotr.onschein 

Radio  Corp,  of  America 

RCA  Labe,,  David  Sarnoff  Res,  Cen. 

Princeton,.  N.  J, 

2  Attn;  Dr,.  J,  Sklansky 

RCA  Labs,,  Princeton,  N,  J, 

1  Attn:  II,  Joht  Bo* 


Melpar,  Inc, 

3000  Arlington  Blvd. 

Falls  Cnurcu,  Va. 

1  Attn:  Librariv 

Microwave  Associates,  Inc. 

Northwest  Industrial  Park 
Burlington,  Mac  a . 

1  Att  1  ;  K.  Mnrtei.son 
1  Librarian 

Microwave  Electronics  Corp, 

*'tOi  1  I'm  sport  St. 

Palo  Ait*.,  Calif. 

1  Att 1 :  C,  F,  Kaisel 
1  M.  C.  Long 

Minneapolis -Honey veil  Regulator  Co. 

11  (7  Blue  Hero:.  Blvd, 

Riviera  Bench,  Fla. 

1  Attn:  Semiconductor  Products  Library 

Monsanto  Research  Corp. 

Station  B,.  Box  o 
Dayton  7,  Ohio 
1  Attn:  Mrs ,  D  Crabtree 

1 

Monsanto  Chemical  Co, 

800  N.  Linbergh  Blvd. 

Gt,  Louis  00,  Mo 

1  Attn:  Mr,  E,  Orban,  Mgr,  Inorganic  1 
Dev, 


RCA,  Missile  Elec,,  and  Controls  Dept,. 
Wobur.n,  Mass . 

1  Attn:  Library 

The  Rand  Corp, 

1700  Main  St, 

Santa  Monica,  Calif, 

1  Attn:  Helen  J.  Waldron,  Librarian 

Raytheon  Manufacturing  Co, 

Microwave  and  Power  Tube  Div. 
Burlington,  Mass. 

1  Attn:  Librarian,  Spencer  Lab, 

Raytheon  Mn;  ul’acturing  Cy. 

Res,  Div.,  So  Geyoa  Gt. 

Waltham,  Mass. 

1  Attn:  Dr.  H.  Stats 
1  Mrs,  M,  Bennett,  Librarian 

Roger  White  Electron  Devices,  Inc, 
Tall  Oakc  Road 

Laurel  Hedges,  Stamford,  Conn. 

Sandia  Corp. 

Sandia  Base,  Albuquerque,  N»  M. 

Attn:  Mrc,  B.  R.  Allen,  Librarian 

Sperry  Rand  Corp. 

Sperry  Electron  Tube  Div. 

Gainesville,  Fla. 


*Dlr.,  National  Physical  Lab. 
Hil3ide  hoad 
New  Delhi  12,  India 
1  Attn:  S,C.  Sharma  -  Via: 

ONR /London 

♦Norther  Electric  Co,.,  Lratd, 
Researcn  and  Development  Labs. 
P.0,  Box  3511,  Station  "C" 

Ottawa,  Ontario,  Canada 
1  Attn;  J*  F.  Tatlock 

Via:  ASD,  Foreign  Release 
Office 

W-P  AFB,  Ohio 
Mr.  J,  Troyan  (ASYF) 


Sperry  Gyroscope  Co. 

Div.  of  Sperry  Rend  Corp. 
Great  Neck,  N.Y. 

r  ■iTiOO) 


Sperry  Gyroscope  Co. 

Engineering  Library 

Mail  Station  P-7 

Great  Neck,.  Long  Island,  N,  V. 

1  Attni  K.  Barney,  Engr.  Dept.  Head 


Sperry  Microwave  Electronics 
Clearwater,  Fla, 

1  Attn:  J.  E.  Pippin,  Ree 


Sec.  Head 
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Oylvarda  Electric  Products,  Inc. 
206-20  Wlllete  Point  Blvd. 

Bay aide,  Long  Island,  N.  Y. 

1  Attn:  L.  R.  Bloom,  Physics  Lab. 

Sylvania  Electronics  Systems 
100  First  Ave, 

Waltlmm  54,  Mass, 

1  Attn;  Librarian,  Waltham  Labs. 

1  Mr.  E.  E.  Hollis 

Technical  Research  Group 
1  Gyusett,  L.I.,  N.Y. 

Texas  Instruments,  Inc, 

Semi  conductor. Components  Div. 

P.0.  Box  205 
Dallas  22,  Tex. 

1  Attn:  Library 

2  Dr.  W.  Adcock 

Texas  Instruments,  Inc, 

I3OO  N.  Central  Expressway 
1  Dallas,  Texas 

Texas  Instrument::,  Inc. 
p,0.  Box  L015 
Dallas  22,  TeA. 

1  Att-.:  M„  E.  C  un,  Apparatus  Div. 

Texas  Instrument; 

0017  £•  Caiie  Tuberla 
Pnoenix,  Arizona 
1  Attn:  R.  L.  Pritci.nrl 

Texas  Instruments,  Inc, 

Corporate  Reaearen  a  d  Engineering 
Tec  u.lcal  Report:;  Service 
P.0.  Box  3474 
1  Attr,:  Dalln.-  LV,  Tex. 

Textronix,  Ire. 

P,0.  Box  300 
Benverton,  0**e» 

4  Attn!  Dr.  2.  F.  DeLord,.  Dir.  >f 

Re-  on 

Varinn  Associate.: 
nil  Har.se-  Way 

Palo  Alto,  Calif. 

1  Attn:  Ter.  .  Libmrv 

Weltemn.i',  Electronics 
45^  Nort  56t  t  Gt, 

1  Milwaukee  Jt  Wlacutu;  tn 

Wectingnouse  Electric  Corp, 
friendcnlp  International  Airport 
Box  74 C,  Baltimore  3,  MU, 

1  Attn:  G,  R.  Kilgore,  Mgr,  Appl. 

Res.  Dept.  Baltimore  lab, 

Westlr.gnouse  Electric  Corp. 

3  Gateway  Center 
Pittsburgh  22,  Pa. 

1  Attn:  Dr,  G.  C.  Sziklul 

Wectingnouse  Electric  C>rp. 

P.O.  Box  264 
Elmira,  N.  Y. 

1  Attn:  G.  S.  King 


Zenit..  Radio  Corp, 
oOOI  Dickens  Ave, 
Chicago  39,  Ill, 

1  Attn;  J.  ft»rkin 
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No  AF  or  Classified  Reports 


